检查数组列是否有重叠元素

kd3sttzy  于 2021-07-12  发布在  Spark
关注(0)|答案(2)|浏览(384)

我有一个带有两个数组列的Dataframe,如下所示:

Arrayed_Column_1
[{"ID":222222,"No":2},{"ID":333333,"No":1}]
[{"ID":555555,"No":2},{"ID":333333,"No":1},{"ID":333333,"No":3}]
[{"ID":222222,"No":2},{"ID":555555,"No":1},{"ID":333333,"No":3}]
[{"ID":555555,"No":2},{"ID":333333,"No":1}]

Arrayed_Column_2
[{"ID":333333,"No":2},{"ID":666663,"No":1}]
[{"ID":333333,"No":2},{"ID":666666,"No":1},{"ID":333333,"No":3}]
[{"ID":222222,"No":2},{"ID":555555,"No":1},{"ID":333333,"No":3}]
[{"ID":555333,"No":2},{"ID":66666,"No":1}]

如果列1的id和no的组合也出现在列2中,而不使用 explode 功能?
我知道 array_contains 但这只检查特定的值。

6pp0gazn

6pp0gazn1#

尝试使用 arrays_overlap :

import pyspark.sql.functions as F

col1 = F.expr('transform(column_1, x -> struct(x.ID as ID, x.No as No, x.Value2 as Value2))')
col2 = F.expr('transform(column_2, x -> struct(x.ID as ID, x.No as No, x.Value2 as Value2))')

df2 = df.filter(F.arrays_overlap(col1, col2))

另一种方法是检查 array_intersect :

df2 = df.filter(F.size(F.array_intersect(col1, col2)) != 0)
j7dteeu8

j7dteeu82#

你也可以使用 exists + array_contains :

df1 = df.filter(
    "exists(Arrayed_Column_1, x -> array_contains(Arrayed_Column_2, x))"
)

相关问题