from pyspark.sql import functions as F
df1 = df.withColumn(
"word",
F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
F.expr("any(lower(text) rlike word)").alias("isList")
)
df1.show(truncate=False)
# +------------------------------------+------+
# |text |isList|
# +------------------------------------+------+
# |I like my two dogs |true |
# |Anna sings like a bird |true |
# |I don't know if I want to have a cat|false |
# |Horseland is a good place |true |
# +------------------------------------+------+
和我一样 max :
df1 = df.withColumn(
"word",
F.explode(F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']]))
).groupBy("text").agg(
F.max(F.expr("lower(text) rlike word")).alias("isList")
)
如果要检查完全匹配,可以使用 arrays_overlap 功能:
words_expr = F.array(*[F.lit(w) for w in ['dog', 'mouse', 'horse', 'bird']])
df1 = df.withColumn(
'isList',
F.arrays_overlap(F.split("text", " "), words_expr)
)
import pyspark.sql.functions as F
word_list = ['dog', 'mouse', 'horse', 'bird']
df2 = df.withColumn(
'words',
F.array(*[F.lit(w) for w in word_list])
).withColumn(
'isList',
F.expr("array_max(transform(words, x -> lower(text) rlike x))")
).drop('words')
df2.show(20,0)
+------------------------------------+------+
|text |isList|
+------------------------------------+------+
|I like my two dogs |true |
|I don't know if I want to have a cat|false |
|Anna sings like a bird |true |
|Horseland is a good place |true |
+------------------------------------+------+
一 filter 也可以对数组执行操作,测试过滤数组的大小(使用匹配的字):
df2 = df.withColumn(
'words',
F.array(*[F.lit(w) for w in word_list])
).withColumn(
'isList',
F.expr("size(filter(words, x -> lower(text) rlike x)) > 0")
).drop('words')
如果你想用 aggregate 这也是可能的:
df2 = df.withColumn(
'words',
F.array(*[F.lit(w) for w in word_list])
).withColumn(
'isList',
F.expr("aggregate(words, false, (acc, x) -> acc or lower(text) rlike x)")
).drop('words')
2条答案
按热度按时间r7xajy2e1#
对于spark 3+,您可以使用任何函数。从列表中创建一个横向数组并将其分解,然后按文本列分组并应用任何:
和我一样
max
:如果要检查完全匹配,可以使用
arrays_overlap
功能:anauzrmj2#
如果要使用数组,则需要
transform
具有rlike
比较:一
filter
也可以对数组执行操作,测试过滤数组的大小(使用匹配的字):如果你想用
aggregate
这也是可能的:请注意,所有这三个高阶函数都要求spark>=2.4。