pyspark在各自的列标志为零时填充空值

sqougxex  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(215)

我有两个Dataframe如下
df1型
列1列2列3ABC021ABC456DEF456XYZ098
df2型
REF列1列2列3A101B001
我想将df1列值填充为null,其中df2 dataframe ref值a为零
出去,出去
第1列第2列第3列BCNULLABC456DEFNULLXYZ098
对于df2Dataframe中的ref值b也是如此
外部参照
第1列第2列第3列NULLNULLABC456NULLXYZ098

ukxgm1gy

ukxgm1gy1#

您可以交叉联接到筛选的 df2 使用 when 只保留值 df1 当标志不等于0时。

import pyspark.sql.functions as F

out_df_refA = (df1.alias('df1')
    .crossJoin(df2.filter("ref = 'A'").drop('ref').alias('df2'))
    .select(*[F.when(F.col('df2.' + c) != 0, F.col('df1.' + c)).alias(c) for c in df1.columns])
)

out_df_refA.show()
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
|    abc|   null| abc456|
|    def|   null| xyz098|
+-------+-------+-------+
import pyspark.sql.functions as F

out_df_refB = (df1.alias('df1')
    .crossJoin(df2.filter("ref = 'B'").drop('ref').alias('df2'))
    .select(*[F.when(F.col('df2.' + c) != 0, F.col('df1.' + c)).alias(c) for c in df1.columns])
)
out_df_refB.show()
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
|   null|   null| abc456|
|   null|   null| xyz098|
+-------+-------+-------+

相关问题