将字符串拆分为单词,检查单词是否与列表项匹配,并将该单词作为新列的值返回

e5nszbig  于 2021-07-12  发布在  Spark
关注(0)|答案(2)|浏览(299)

我有一个带列的Dataframe text 包含字符串(或null)。
如果列中的单词长度 text 是>=6和<=11,那么我想把它和 word_list .
如果单词匹配,则这是新列的值 match ```
import pyspark.sql.functions as F

df = spark.createDataFrame([
["This is line one"],
["This is line two"],
["bla coroner foo bar"],
["This is line three"],
["foo bar shakespeare"],
[None]
]).toDF("text")

word_list = ["one", "two","shakespeare", "three", "coroner"]

期望结果

+----------------------------------+------------+
|text |match |
+----------------------------------+------------+
|This is line one |Null |
|This is line two |Null |
|bla coroner foo bar |coroner |
|This is line three |Null |
|foo bar shakespeare |shakespeare |
|Null |Null |
+----------------------------------+------------+

我知道如何用文字来分割字符串,但在那之后就不管用了。我迷路了。
ekqde3dh

ekqde3dh1#

你可以用 regexp_extract 要获取相关字符串:

import pyspark.sql.functions as F

pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11])

df2 = df.withColumn(
    'match',
    F.regexp_extract(
        'text',
        rf"\b({pattern})\b",
        1
    )
).withColumn(
    'match',
    F.when(F.col('match') != '', F.col('match'))    # replace no match with null
)

df2.show(truncate=False)
+----------------------------------+------------+
|text                              |match       |
+----------------------------------+------------+
|This is line one                  |Null        |
|This is line two                  |Null        |
|bla coroner foo bar               |coroner     |
|This is line three                |Null        |
|foo bar shakespeare               |shakespeare |
|Null                              |Null        |
+----------------------------------+------------+

这个 pattern 有点像 \b(word1|word2|word3)\b ,在哪里 \b 表示单词边界(空格/行首/行尾),以及 | 手段 or .

nsc4cvqm

nsc4cvqm2#

你可以使用这个列表 word_list 作为数组文字并检查数组与列的交集 text :

from pyspark.sql import functions as F

word_list_arr = F.array(*[F.lit(w) for w in word_list if len(w) >= 6 and len(w) <= 11])

df1 = df.withColumn(
    "match",
    F.array_join(F.array_intersect(F.split("text", " "), word_list_arr), " ")
).withColumn("match", F.expr("nullif(match, '')"))

df1.show(truncate=False)

# +----------------------------------+------------+

# |text                              |match       |

# +----------------------------------+------------+

# |This is line one                  |Null        |

# |This is line two                  |Null        |

# |bla coroner foo bar               |coroner     |

# |This is line three                |Null        |

# |foo bar shakespeare               |shakespeare |

# |Null                              |Null        |

# +----------------------------------+------------+

相关问题