将字符串拆分为单词，检查单词是否与列表项匹配，并将该单词作为新列的值返回

e5nszbig 于 2021-07-12 发布在 Spark

关注(0)|答案(2)|浏览(298)

我有一个带列的Dataframe text 包含字符串（或null）。
如果列中的单词长度 text 是>=6和<=11，那么我想把它和 word_list .
如果单词匹配，则这是新列的值 match ```
import pyspark.sql.functions as F

df = spark.createDataFrame([
["This is line one"],
["This is line two"],
["bla coroner foo bar"],
["This is line three"],
["foo bar shakespeare"],
[None]
]).toDF("text")

word_list = ["one", "two","shakespeare", "three", "coroner"]

期望结果

我知道如何用文字来分割字符串，但在那之后就不管用了。我迷路了。

apache-spark pyspark apache-spark-sql List split

来源：https://stackoverflow.com/questions/66470199/split-a-string-in-words-and-check-if-a-word-matches-a-list-item-and-return-that

2条答案

按热度按时间

ekqde3dh1#

你可以用 regexp_extract 要获取相关字符串：

import pyspark.sql.functions as F
pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11])
df2 = df.withColumn(
    'match',
    F.regexp_extract(
        'text',
        rf"\b({pattern})\b",
        1
    )
).withColumn(
    'match',
    F.when(F.col('match') != '', F.col('match'))    # replace no match with null
)
df2.show(truncate=False)
+----------------------------------+------------+
|text                              |match       |
+----------------------------------+------------+
|This is line one                  |Null        |
|This is line two                  |Null        |
|bla coroner foo bar               |coroner     |
|This is line three                |Null        |
|foo bar shakespeare               |shakespeare |
|Null                              |Null        |
+----------------------------------+------------+

这个 pattern 有点像 \b(word1|word2|word3)\b ，在哪里 \b 表示单词边界（空格/行首/行尾），以及 | 手段 or .

展开查看全部

赞(0）回复(0）举报 2021-07-12

nsc4cvqm2#

你可以使用这个列表 word_list 作为数组文字并检查数组与列的交集 text :

from pyspark.sql import functions as F
word_list_arr = F.array(*[F.lit(w) for w in word_list if len(w) >= 6 and len(w) <= 11])
df1 = df.withColumn(
    "match",
    F.array_join(F.array_intersect(F.split("text", " "), word_list_arr), " ")
).withColumn("match", F.expr("nullif(match, '')"))
df1.show(truncate=False)
# +----------------------------------+------------+
# |text                              |match       |
# +----------------------------------+------------+
# |This is line one                  |Null        |
# |This is line two                  |Null        |
# |bla coroner foo bar               |coroner     |
# |This is line three                |Null        |
# |foo bar shakespeare               |shakespeare |
# |Null                              |Null        |
# +----------------------------------+------------+

展开查看全部

赞(0）回复(0）举报 2021-07-12

我来回答

将字符串拆分为单词，检查单词是否与列表项匹配，并将该单词作为新列的值返回

2条答案

相关问题

热门标签

最新问答