如何将df转换为add列,并在另一列中包含字符串列表

yduiuuwa  于 2021-07-09  发布在  Spark
关注(0)|答案(2)|浏览(287)

假设我在scala中有一个关键字列表 val keywords = List("pineapple", "lemon") 像这样的Dataframe

+---+-------------------------------------------+
|ID |Body                                       |
+---+-------------------------------------------+
|123|I contain both keywords pineapple and lemon|
|456|I sadly don't contain anything...          |
|789|Pineapple's are delicious                  |
+---+-------------------------------------------+

如何将此Dataframe转换为包含以下关键字的新列: Body 包含?我想要的结果是

+---+-------------------------------------------+------------------+
|ID |Body                                       |Contains_Keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|456|I sadly don't contain anything...          |[]                |
|789|Pineapple's are delicious                  |[pineapple]       |
+---+-------------------------------------------+------------------+
niknxzdl

niknxzdl1#

检查以下代码。
正在使用所需的示例数据创建dataframe。

scala> val df = Seq(
      (123,"I contain both keywords pineapple and lemon"),
      (456,"I sadly don't contain anything"),
      (789,"Pineapple's are delicious")).toDF("id","body")

df: org.apache.spark.sql.DataFrame = [id: int, body: string]
scala> val keywords = List("pineapple", "lemon")
keywords: List[String] = List(pineapple, lemon)
``` `typedLit` 添加 `keywords` 使用Dataframe(&U) `filter` 高阶函数来检查 `keyword` 包含 `body` 列。

scala> df
.withColumn("keywords",typedLit(keywords))
.withColumn("Contains_Keywords",expr("filter(keywords,keyword -> instr(lower(body),keyword) > 0)"))
.show(false)

最终输出

+---+-------------------------------------------+------------------+------------------+
|id |body |keywords |Contains_Keywords |
+---+-------------------------------------------+------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|[pineapple, lemon]|
|456|I sadly don't contain anything |[pineapple, lemon]|[] |
|789|Pineapple's are delicious |[pineapple, lemon]|[pineapple] |
+---+-------------------------------------------+------------------+------------------+

9jyewag0

9jyewag02#

您可以将关键字列表转换为Dataframe,然后基于 rlike 条件。这是好的补充 \\\\b 在关键字之前和之后指定单词边界,这样可以防止部分匹配,例如。 apple 匹配 pineapple .

val result = df.as("df")
    .join(keywords.toDF("keywords").as("keywords"), 
          expr("lower(df.body) rlike '\\\\b' || keywords.keywords || '\\\\b'"), 
          "left"
         )
    .groupBy("id", "body")
    .agg(collect_list("keywords").as("Contains_keywords"))

result.show(false)
+---+-------------------------------------------+------------------+
|id |body                                       |Contains_keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|789|Pineapple's are delicious                  |[pineapple]       |
|456|I sadly don't contain anything             |[]                |
+---+-------------------------------------------+------------------+

相关问题