python—忽略nan值的pyspark列之和

ego6inou  于 2021-07-13  发布在  Spark
关注(0)|答案(1)|浏览(329)

我有一个pyparkDataframe,如下所示:

+---+----+----+
| id|col1|col2|
+---+----+----+
|  1|   1|   3|
|  2| NaN|   4|
|  3|   3|   5|
+---+----+----+

我想总结一下 col1 以及 col2 结果如下:

+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
|  1|   1|   3|  4|
|  2| NaN|   4|  4|
|  3|   3|   5|  8|
+---+----+----+---+

以下是我尝试过的:

import pandas as pd

test = pd.DataFrame({
    'id': [1, 2, 3],
    'col1': [1, None, 3],
    'col2': [3, 4, 5]
})
test = spark.createDataFrame(test)
test.withColumn('sum', F.col('col1') + F.col('col2')).show()

此代码返回:

+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
|  1|   1|   3|  4|
|  2| NaN|   4|NaN| # <-- I want a 4 here, not this NaN
|  3|   3|   5|  8|
+---+----+----+---+

有人能帮我吗?

um6iljoc

um6iljoc1#

使用 F.nanvl 替换 NaN 给定值(此处为0):

import pyspark.sql.functions as F

result = test.withColumn('sum', F.nanvl(F.col('col1'), F.lit(0)) + F.col('col2'))

请发表评论:

result = test.withColumn('sum', 
    F.when(
        F.isnan(F.col('col1')) & F.isnan(F.col('col2')), 
        F.lit(float('nan'))
    ).otherwise(
        F.nanvl(F.col('col1'), F.lit(0)) + F.nanvl(F.col('col2'), F.lit(0))
    )
)

相关问题