Pyspark GroupBy Flatten row

mi7gmzs6  于 2023-11-16  发布在  Spark
关注(0)|答案(1)|浏览(121)

我一直在尝试在pyspark中将行展平,因为我的数据集如下所示

|member_id|age|gender|      date|cost|
+---------+---+------+----------+-----------+
|        1| 35|  Male|2023-10-01|        200|
|        1| 35|  Male|2023-10-02|        210|
|        2| 28|Female|2023-10-01|        150|
|        2| 28|Female|2023-10-02|        160|
+---------+---+------+----------+-----------+

字符串
现在我想要的输出是:

[
    [
       [1,35, Male, 2023-10-01, 200],[1, 35, Male, 2023-10-02, 210]
    ],
    [
       [2, 28, Female, 2023-10-01, 150],[2, 28, Female, 2023-10-01, 160]
    ]
]


我试过了,但我做不到。

n53p2ov0

n53p2ov01#

简单的想法:
首先通过F.array函数将所有列收集到列表中

groupBy上做F.collect_list

from pyspark import SQLContext
from pyspark.sql.functions import *
import pyspark.sql.functions as F

sc = SparkContext('local')
sqlContext = SQLContext(sc)

data1 = [

[1, 35,"Male","2023-10-01",200],
[1, 35,"Male","2023-10-02",210],
[2, 28,"Female","2023-10-01",150],
[2, 28,"Female","2023-10-02",160],

]

columns =["member_id", "age", "gender", "date", "cost"]

df1 = sqlContext.createDataFrame(data=data1, schema=columns)

df1.show(n=10, truncate=False)
print("Collect columns into list")
df2 = df1.withColumn("colList", F.array(*df1.columns))
df2.show(n=10, truncate=False)
print("collect_list on the groupBy")
df3 = df2.groupBy("member_id").agg(F.collect_list(F.col("colList")))
df3.show(n=10, truncate=False)

字符串
输出量:

+---------+---+------+----------+----+
|member_id|age|gender|date      |cost|
+---------+---+------+----------+----+
|1        |35 |Male  |2023-10-01|200 |
|1        |35 |Male  |2023-10-02|210 |
|2        |28 |Female|2023-10-01|150 |
|2        |28 |Female|2023-10-02|160 |
+---------+---+------+----------+----+

Collect columns into list
+---------+---+------+----------+----+--------------------------------+
|member_id|age|gender|date      |cost|colList                         |
+---------+---+------+----------+----+--------------------------------+
|1        |35 |Male  |2023-10-01|200 |[1, 35, Male, 2023-10-01, 200]  |
|1        |35 |Male  |2023-10-02|210 |[1, 35, Male, 2023-10-02, 210]  |
|2        |28 |Female|2023-10-01|150 |[2, 28, Female, 2023-10-01, 150]|
|2        |28 |Female|2023-10-02|160 |[2, 28, Female, 2023-10-02, 160]|
+---------+---+------+----------+----+--------------------------------+

collect_list on the groupBy
+---------+--------------------------------------------------------------------+
|member_id|collect_list(colList)                                               |
+---------+--------------------------------------------------------------------+
|1        |[[1, 35, Male, 2023-10-01, 200], [1, 35, Male, 2023-10-02, 210]]    |
|2        |[[2, 28, Female, 2023-10-01, 150], [2, 28, Female, 2023-10-02, 160]]|
+---------+--------------------------------------------------------------------+

相关问题