在pyspark中按行连接字符串

ni65a41a  于 2023-02-09  发布在  Apache
关注(0)|答案(3)|浏览(137)

我有一个pyspark Dataframe 作为

DOCTOR | PATIENT
JOHN   | SAM
JOHN   | PETER
JOHN   | ROBIN
BEN    | ROSE
BEN    | GRAY

并且需要按行连接患者姓名,以便获得如下输出:

DOCTOR | PATIENT
JOHN   | SAM, PETER, ROBIN
BEN    | ROSE, GRAY

任何人都可以帮助我在pyspark中创建这个 Dataframe 吗?
先谢了。

neskvpey

neskvpey1#

我能想到的最简单的方法是使用collect_list

import pyspark.sql.functions as f
df.groupby("col1").agg(f.concat_ws(", ", f.collect_list(df.col2)))
368yc8dk

368yc8dk2#

import pyspark.sql.functions as f
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

data = [
  ("U_104", "food"),
  ("U_103", "cosmetics"),
  ("U_103", "children"),
  ("U_104", "groceries"),
  ("U_103", "food")
]
schema = StructType([
  StructField("user_id", StringType(), True),
  StructField("category", StringType(), True),
])
sc = SparkContext.getOrCreate()
spark = SparkSession.builder.appName("groupby").getOrCreate()
df = spark.createDataFrame(data, schema)
group_df = df.groupBy(f.col("user_id")).agg(
  f.concat_ws(",", f.collect_list(f.col("category"))).alias("categories")
)
group_df.show()
+-------+--------------------+
|user_id|          categories|
+-------+--------------------+
|  U_104|      food,groceries|
|  U_103|cosmetics,childre...|
+-------+--------------------+

下面是一些有用的聚合示例

watbbzwu

watbbzwu3#

使用Spark SQL,我可以做到:
SELECT第1列,第2列,第3列,REPLACE(REPLACE(CAST(collect_list(col4)AS string),"[",””),"]",””)从您的表中按第1列,第2列,第3列分组

相关问题