Pyspark DataFrame,如何按聚合函数(如标准偏差)进行分组

dnph8jn4  于 2022-11-01  发布在  Spark
关注(0)|答案(1)|浏览(159)

Pyspark & Pandas数据框,如何通过聚合函数(如标准差)进行分组

data = [['_1','S1',12, 112, 14],
        ['_2','S1',120, 112, 114],
        ['_3','S2',88, 92, 74],
        ['_4','S2',17, 118, 133],
        ['_5','S2',19, 19, 14],
        ['_6','S2',11, 12, 14]]
columns = ['RowNum','School','Subject_1', 'Subject_2', 'Subject_3']
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
pandasDF = dataframe.toPandas()

+------+------+---------+---------+---------+
|RowNum|School|Subject_1|Subject_2|Subject_3|
+------+------+---------+---------+---------+
|    _1|    S1|       12|      112|       14|
|    _2|    S1|      120|      112|      114|
|    _3|    S2|       88|       92|       74|
|    _4|    S2|       17|      118|      133|
|    _5|    S2|       19|       19|       14|
|    _6|    S2|       11|       12|       14|
+------+------+---------+---------+---------+

给定数据,按学校分组,找出受试者的标准差?像Pandas一样使用PySpark

def std(x): 
    return np.std(x)

df.drop(['RowNum'],axis=0).groupby('School').agg(['mean', 'max', std])
4uqofj5v

4uqofj5v1#

我想你在找的是这样的东西:

import pyspark.sql.functions as f
column_list = ['Subject_1', 'Subject_2', 'Subject_3']
df = (
    df
    .groupBy('School')
    .agg(
        *[
            *[f.stddev(f.col(element)).alias(f'stddev_{element}') for element in column_list],
            *[f.max(f.col(element)).alias(f'mean_{element}') for element in column_list],
            *[f.max(f.col(element)).alias(f'max_{element}') for element in column_list]
        ]
    )
)

相关问题