我有一个sparkDataframe,比如:
+---+---------+------+
| id| timeSlot| ratio|
+---+---------+------+
|100| lunch| 0.2|
|100| morning| 0.3|
|100| dinner| 0.5|
|101| morning| 0.1|
|101| lunch| 0.2|
|101| dinner| 0.7|
|102| lunch| 0.4|
|102| morning| 0.3|
|102| dinner| 0.3|
+---+---------+------+
我想将其转换为以下格式:pivot by timeSlot
在列中, id
作为行索引,以及 ratio
:
+---+--------+------+-------+
| id| morning| lunch| dinner|
+---+--------+------+-------+
|100| 0.3| 0.2| 0.5|
|101| 0.1| 0.2| 0.7|
|102| 0.3| 0.4| 0.3|
+---+--------+------+-------+
对于这样的操作,在pandas中我使用多索引方法,例如 set_index()
以及 unstack()
:
import pandas as pd
ids = pd.Series([100, 100, 100, 101, 101, 101, 102 ,102, 102])
timeSlots = pd.Series(["lunch", "morning", "dinner", "morning",
"lunch", "dinner", "lunch" ,"morning", "dinner"])
ratios = pd.Series([0.2,0.3,0.5,0.1,0.2,0.7,0.4,0.3,0.3])
df = pd.DataFrame({"id": ids, "timeSlot": timeSlots, "ratio": ratios}) # create initial data frame
df = df.set_index(['id', 'timeSlot']) # main goal1 : Set 'id' and 'timeSlot' as multi indexes.
df = df.unstack() # main goal2 : Switch 'timeSlot' to column index
df = df['ratio'] # remove one layer of column index
df = df[['morning', 'lunch', 'dinner']] # reorder columns
像这样的多指标运算有没有Spark等效法 set_index()
或者 unstack()
?
如果没有,还有其他方法可以达到同样的目的吗?
1条答案
按热度按时间kxeu7u2r1#
你可以用
groupby
以及pivot
在pyspark
```import pandas as pd
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
your initial df
df = pd.DataFrame({"id": ids, "timeSlot": timeSlots, "ratio": ratios})
convert pandas df to spark df
spark_df = spark.createDataFrame(df)
combine groupby, pivot and agg methods to get your result
spark_df.groupby("id")
.pivot("timeSlot")
.agg(F.sum("ratio"))\
.show()
+---+------+-----+-------+
| id|dinner|lunch|morning|
+---+------+-----+-------+
|100| 0.5| 0.2| 0.3|
|101| 0.7| 0.2| 0.1|
|102| 0.3| 0.4| 0.3|
+---+------+-----+-------+