有没有类似set_index()或unstack()这样的多索引操作的spark等价物?

jqjz2hbq  于 2021-05-29  发布在  Spark
关注(0)|答案(1)|浏览(363)

我有一个sparkDataframe,比如:

+---+---------+------+
| id| timeSlot| ratio|
+---+---------+------+
|100|    lunch|   0.2|
|100|  morning|   0.3|
|100|   dinner|   0.5|
|101|  morning|   0.1|
|101|    lunch|   0.2|
|101|   dinner|   0.7|
|102|    lunch|   0.4|
|102|  morning|   0.3|
|102|   dinner|   0.3|
+---+---------+------+

我想将其转换为以下格式:pivot by timeSlot 在列中, id 作为行索引,以及 ratio :

+---+--------+------+-------+
| id| morning| lunch| dinner|
+---+--------+------+-------+
|100|     0.3|   0.2|    0.5|
|101|     0.1|   0.2|    0.7|
|102|     0.3|   0.4|    0.3|
+---+--------+------+-------+

对于这样的操作,在pandas中我使用多索引方法,例如 set_index() 以及 unstack() :

import pandas as pd

ids = pd.Series([100, 100, 100, 101, 101, 101, 102 ,102, 102])
timeSlots = pd.Series(["lunch", "morning", "dinner", "morning", 
                       "lunch", "dinner", "lunch" ,"morning", "dinner"])
ratios = pd.Series([0.2,0.3,0.5,0.1,0.2,0.7,0.4,0.3,0.3])
df = pd.DataFrame({"id": ids, "timeSlot": timeSlots, "ratio": ratios}) # create initial data frame

df = df.set_index(['id', 'timeSlot']) # main goal1 : Set 'id' and 'timeSlot' as multi indexes.
df = df.unstack() # main goal2 : Switch 'timeSlot' to column index
df = df['ratio'] # remove one layer of column index
df = df[['morning', 'lunch', 'dinner']] # reorder columns

像这样的多指标运算有没有Spark等效法 set_index() 或者 unstack() ?
如果没有,还有其他方法可以达到同样的目的吗?

kxeu7u2r

kxeu7u2r1#

你可以用 groupby 以及 pivotpyspark ```
import pandas as pd
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()

your initial df

df = pd.DataFrame({"id": ids, "timeSlot": timeSlots, "ratio": ratios})

convert pandas df to spark df

spark_df = spark.createDataFrame(df)

combine groupby, pivot and agg methods to get your result

spark_df.groupby("id")
.pivot("timeSlot")
.agg(F.sum("ratio"))\
.show()

+---+------+-----+-------+

| id|dinner|lunch|morning|

+---+------+-----+-------+

|100| 0.5| 0.2| 0.3|

|101| 0.7| 0.2| 0.1|

|102| 0.3| 0.4| 0.3|

+---+------+-----+-------+

请注意,我使用sum作为一个聚合函数,但由于您只有一个row by(id,timeslot)组合,您还可以使用其他函数,如mean,max。。。

相关问题