pyspark:检查数组中的值是否在列中

tvokkenx  于 2021-07-12  发布在  Spark
关注(0)|答案(2)|浏览(383)

我想检查数组中是否有值:

list = ['dog', 'mouse', 'horse', 'bird']

出现在pyspark dataframe列中:
我喜欢我的两只狗我不知道我是否想要一只猫假安娜唱得像鸟真的是个好地方
我发现在多个词的情况下人们倾向于使用 dog|mouse|horse|bird 但我有很多,我想使用数组。你能帮帮我吗?

r7xajy2e

r7xajy2e1#

对于spark 3+,您可以使用任何函数。从列表中创建一个横向数组并将其分解,然后按文本列分组并应用任何:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "word",
    F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
    F.expr("any(lower(text) rlike word)").alias("isList")
)

df1.show(truncate=False)

# +------------------------------------+------+

# |text                                |isList|

# +------------------------------------+------+

# |I like my two dogs                  |true  |

# |Anna sings like a bird              |true  |

# |I don't know if I want to have a cat|false |

# |Horseland is a good place           |true  |

# +------------------------------------+------+

和我一样 max :

df1 = df.withColumn(
    "word",
    F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
    F.max(F.expr("lower(text) rlike word")).alias("isList")
)

如果要检查完全匹配,可以使用 arrays_overlap 功能:

words_expr = F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']])

df1 = df.withColumn(
    'isList',
    F.arrays_overlap(F.split("text", " "), words_expr)
)
anauzrmj

anauzrmj2#

如果要使用数组,则需要 transform 具有 rlike 比较:

import pyspark.sql.functions as F

word_list = ['dog', 'mouse', 'horse', 'bird']

df2 = df.withColumn(
    'words',
    F.array(*[F.lit(w) for w in word_list])
).withColumn(
    'isList',
    F.expr("array_max(transform(words, x -> lower(text) rlike x))")
).drop('words')

df2.show(20,0)
+------------------------------------+------+
|text                                |isList|
+------------------------------------+------+
|I like my two dogs                  |true  |
|I don't know if I want to have a cat|false |
|Anna sings like a bird              |true  |
|Horseland is a good place           |true  |
+------------------------------------+------+

filter 也可以对数组执行操作,测试过滤数组的大小(使用匹配的字):

df2 = df.withColumn(
    'words',
    F.array(*[F.lit(w) for w in word_list])
).withColumn(
    'isList',
    F.expr("size(filter(words, x -> lower(text) rlike x)) > 0")
).drop('words')

如果你想用 aggregate 这也是可能的:

df2 = df.withColumn(
    'words',
    F.array(*[F.lit(w) for w in word_list])
).withColumn(
    'isList',
    F.expr("aggregate(words, false, (acc, x) -> acc or lower(text) rlike x)")
).drop('words')

请注意,所有这三个高阶函数都要求spark>=2.4。

相关问题