假设我在scala中有一个关键字列表 val keywords = List("pineapple", "lemon")
像这样的Dataframe
+---+-------------------------------------------+
|ID |Body |
+---+-------------------------------------------+
|123|I contain both keywords pineapple and lemon|
|456|I sadly don't contain anything... |
|789|Pineapple's are delicious |
+---+-------------------------------------------+
如何将此Dataframe转换为包含以下关键字的新列: Body
包含?我想要的结果是
+---+-------------------------------------------+------------------+
|ID |Body |Contains_Keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|456|I sadly don't contain anything... |[] |
|789|Pineapple's are delicious |[pineapple] |
+---+-------------------------------------------+------------------+
2条答案
按热度按时间niknxzdl1#
检查以下代码。
正在使用所需的示例数据创建dataframe。
scala> df
.withColumn("keywords",typedLit(keywords))
.withColumn("Contains_Keywords",expr("filter(keywords,keyword -> instr(lower(body),keyword) > 0)"))
.show(false)
+---+-------------------------------------------+------------------+------------------+
|id |body |keywords |Contains_Keywords |
+---+-------------------------------------------+------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|[pineapple, lemon]|
|456|I sadly don't contain anything |[pineapple, lemon]|[] |
|789|Pineapple's are delicious |[pineapple, lemon]|[pineapple] |
+---+-------------------------------------------+------------------+------------------+
9jyewag02#
您可以将关键字列表转换为Dataframe,然后基于
rlike
条件。这是好的补充\\\\b
在关键字之前和之后指定单词边界,这样可以防止部分匹配,例如。apple
匹配pineapple
.