python—如何在pyspark中应用条件,使其仅在其他人删除null时保持null

dddzy1tm  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(351)

条件:
如果id的得分为“高”或“中”->则删除“无”
如果id只有score none->只保留none
输入:
IDSCOREAAAHIGHAAAMIDAANONEBBBNONE公司
期望输出:
idscoreaaahighaaamidbbbnone公司
我在pyspark中写if条件有困难。或者有没有其他方法来解决这个问题?
谢谢你的帮助。非常感谢!

izj3ouym

izj3ouym1#

你可以数数 Score Windows上方 ID ,然后打开过滤器 Score 不为null或计数为0:

from pyspark.sql import Window
from pyspark.sql import functions as F

df1 = df.withColumn(
    "count_scores",
    F.count("Score").over(Window.partitionBy("ID"))
).where("Score IS NOT NULL OR count_scores = 0")\
 .drop("count_scores")

df1.show()

# +---+-----+

# | ID|Score|

# +---+-----+

# |BBB| null|

# |AAA| High|

# |AAA|  Mid|

# +---+-----+
hgqdbh6s

hgqdbh6s2#

您可以添加是否所有分数都为空的标志,并过滤分数不为空或标志为真(所有分数都为空)时的行:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'flag', 
    F.min(F.col('Score').isNull()).over(Window.partitionBy('ID'))
).filter('flag or Score is not null').drop('flag')

df2.show()
+---+-----+
| ID|Score|
+---+-----+
|BBB| null|
|AAA| High|
|AAA|  Mid|
+---+-----+

相关问题