将字符串拆分为单词,检查单词是否与列表项匹配,并将该单词作为新列的值返回

e5nszbig  于 2021-07-12  发布在  Spark
关注(0)|答案(2)|浏览(298)

我有一个带列的Dataframe text 包含字符串(或null)。
如果列中的单词长度 text 是>=6和<=11,那么我想把它和 word_list .
如果单词匹配,则这是新列的值 match ```
import pyspark.sql.functions as F

df = spark.createDataFrame([
["This is line one"],
["This is line two"],
["bla coroner foo bar"],
["This is line three"],
["foo bar shakespeare"],
[None]
]).toDF("text")

word_list = ["one", "two","shakespeare", "three", "coroner"]

  1. 期望结果

+----------------------------------+------------+
|text |match |
+----------------------------------+------------+
|This is line one |Null |
|This is line two |Null |
|bla coroner foo bar |coroner |
|This is line three |Null |
|foo bar shakespeare |shakespeare |
|Null |Null |
+----------------------------------+------------+

  1. 我知道如何用文字来分割字符串,但在那之后就不管用了。我迷路了。
ekqde3dh

ekqde3dh1#

你可以用 regexp_extract 要获取相关字符串:

  1. import pyspark.sql.functions as F
  2. pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11])
  3. df2 = df.withColumn(
  4. 'match',
  5. F.regexp_extract(
  6. 'text',
  7. rf"\b({pattern})\b",
  8. 1
  9. )
  10. ).withColumn(
  11. 'match',
  12. F.when(F.col('match') != '', F.col('match')) # replace no match with null
  13. )
  14. df2.show(truncate=False)
  15. +----------------------------------+------------+
  16. |text |match |
  17. +----------------------------------+------------+
  18. |This is line one |Null |
  19. |This is line two |Null |
  20. |bla coroner foo bar |coroner |
  21. |This is line three |Null |
  22. |foo bar shakespeare |shakespeare |
  23. |Null |Null |
  24. +----------------------------------+------------+

这个 pattern 有点像 \b(word1|word2|word3)\b ,在哪里 \b 表示单词边界(空格/行首/行尾),以及 | 手段 or .

展开查看全部
nsc4cvqm

nsc4cvqm2#

你可以使用这个列表 word_list 作为数组文字并检查数组与列的交集 text :

  1. from pyspark.sql import functions as F
  2. word_list_arr = F.array(*[F.lit(w) for w in word_list if len(w) >= 6 and len(w) <= 11])
  3. df1 = df.withColumn(
  4. "match",
  5. F.array_join(F.array_intersect(F.split("text", " "), word_list_arr), " ")
  6. ).withColumn("match", F.expr("nullif(match, '')"))
  7. df1.show(truncate=False)
  8. # +----------------------------------+------------+
  9. # |text |match |
  10. # +----------------------------------+------------+
  11. # |This is line one |Null |
  12. # |This is line two |Null |
  13. # |bla coroner foo bar |coroner |
  14. # |This is line three |Null |
  15. # |foo bar shakespeare |shakespeare |
  16. # |Null |Null |
  17. # +----------------------------------+------------+
展开查看全部

相关问题