左反联接不将null视为spark中的重复值

yqkkidmi  于 2021-05-29  发布在  Spark
关注(0)|答案(1)|浏览(325)

我有两个表,我只想从源表中读取唯一的记录这两个表都有空值。

source table:

name| age| degree| dept    
aaa | 20| ece |null
bbb |20 |it |null
ccc |30 |mech| null

target table

name| age |degree |dept
aaa  |20| ece |null
bbb |20 |it| null

soruce_df.join(目标_df,seq(“name”,“age”,“degree”),“leftanti”)->工作
soruce_df.join(target_df,seq(“name”,“age”,“degree”,“dept”),“leftanti”)->不工作

Now i need to pick only 3rd record from source ,

 If i use name ,age ,degree   as my joining key , it's working as expected

But when i include dept it's picking all the records from source table.

Please help me.
ikfrs5lh

ikfrs5lh1#

进行对空值安全的相等性测试。

soruce_df.join(target_df, soruce_df("name") <=> target_df("name") && soruce_df("age") <=> target_df("age") &&
      soruce_df("degree") <=> target_df("degree") && soruce_df("dept") <=> target_df("dept")
      ,"leftanti").show(false)

    /**
      * +----+---+------+----+
      * |name|age|degree|dept|
      * +----+---+------+----+
      * |ccc |30 |mech  |null|
      * +----+---+------+----+
      */

在python中,替换 <=> 带方法调用 eqNullSafe 如下样品-

df1.join(df2, df1["value"].eqNullSafe(df2["value"]))

相关问题