Pyspark & Pandas数据框,如何通过聚合函数(如标准差)进行分组
data = [['_1','S1',12, 112, 14],
['_2','S1',120, 112, 114],
['_3','S2',88, 92, 74],
['_4','S2',17, 118, 133],
['_5','S2',19, 19, 14],
['_6','S2',11, 12, 14]]
columns = ['RowNum','School','Subject_1', 'Subject_2', 'Subject_3']
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
pandasDF = dataframe.toPandas()
+------+------+---------+---------+---------+
|RowNum|School|Subject_1|Subject_2|Subject_3|
+------+------+---------+---------+---------+
| _1| S1| 12| 112| 14|
| _2| S1| 120| 112| 114|
| _3| S2| 88| 92| 74|
| _4| S2| 17| 118| 133|
| _5| S2| 19| 19| 14|
| _6| S2| 11| 12| 14|
+------+------+---------+---------+---------+
给定数据,按学校分组,找出受试者的标准差?像Pandas一样使用PySpark
def std(x):
return np.std(x)
df.drop(['RowNum'],axis=0).groupby('School').agg(['mean', 'max', std])
1条答案
按热度按时间4uqofj5v1#
我想你在找的是这样的东西: