不同列忽略空值的平均值,spark scala

af7jpaap  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(321)

我有一个不同列的Dataframe,我要做的是忽略空值的diff列的平均值。例如:

+--------+-------+---------+-------+
| Baller | Power | Vision  | KXD   |
+--------+-------+---------+-------+
| John   |   5   |    null |   10  |
| Bilbo  |   5   |    3    |    2  |
+--------+-------+---------+-------+

输出必须是:

+--------+-------+---------+-------+-----------+
| Baller | Power | Vision  | KXD   | MEAN      |
+--------+-------+---------+-------+-----------+
| John   |   5   |    null |   10  |    7.5    |
| Bilbo  |   5   |    3    |    2  |    3,33   |
+--------+-------+---------+-------+-----------+

我在做什么:

val a_cols = Array(col("Power"), col("Vision"), col("KXD"))

val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length

val avg_calc = df.withColumn("MEAN", avgFunc)

但是我得到空值:

+--------+-------+---------+-------+-----------+
| Baller | Power | Vision  | KXD   | MEAN      |
+--------+-------+---------+-------+-----------+
| John   |   5   |    null |   10  |    null   |
| Bilbo  |   5   |    3    |    2  |    3,33   |
+--------+-------+---------+-------+-----------+
lyr7nygr

lyr7nygr1#

您可以分解列并按+平均值进行分组,然后使用baller列连接回原始Dataframe:

val result = df.join(
    df.select(
        col("Baller"), 
        explode(array(col("Power"), col("Vision"), col("KXD")))
    ).groupBy("Baller").agg(mean("col").as("MEAN")), 
    Seq("Baller")
)

result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD|              MEAN|
+------+-----+------+---+------------------+
|  John|    5|  null| 10|               7.5|
| Bilbo|    5|     3|  2|3.3333333333333335|
+------+-----+------+---+------------------+

相关问题