pyspark dataframe -判断一列是否同时包含多个字符串的问题

ukxgm1gy 于 2022-12-03 发布在 Spark

关注(0)|答案(1)|浏览(302)

1.i具有以下pyspark Dataframe

message,type,object
"they are one, two, three, four, five, six",typeA,objectA
"they are one, two",typeB,objectB
"they are four,five",typeC,objectC
"they are six, five, four, three, two, one",typeD,objectD
"they are six, one, five, three, two, four",typeE,objectE

2.现在我想返回一个结果，消息列包含6个单词：一，二，三，四，五，六。六部作品之间的关系是AND，而不是OR。
所以预期的结果是：

message,type,object
"they are one, two, three, four, five, six",typeA,objectA
"they are six, five, four, three, two, one",typeD,objectD
"they are six, one, five, three, two, four",typeE,objectE

1.下面是我使用的代码，但在2中未能返回预期的结果。
如果您有任何问题，请联系我们。如果您有问题，请联系我们。
我知道我可以使用6包含函数来达到预期的结果，但是如果有很多条件，代码看起来太长了。

df.message.contains("one") & df.message.contains("two")...&df.message.contains("six")

有没有Maven能帮我看看为什么rlike函数没有把我带到预期的结果？

pyspark

来源：https://stackoverflow.com/questions/74640035/pyspark-dataframe-issue-of-judging-if-a-column-contains-multiple-strings-at-th

1条答案

按热度按时间

n53p2ov01#

我找到解决办法了。
如果列包含所有必需的字符串，则编写一个正则表达式进行匹配。

regex_pattern = "^(?=.*one)(?=.*two)(?=.*three)(?=.*four)(?=.*five)(?=.*six).*$"
df.filter(df.message.rlike(regex_pattern))

赞(0）回复(0）举报 2022-12-03

我来回答

pyspark dataframe -判断一列是否同时包含多个字符串的问题

1条答案

相关问题

热门标签

最新问答