spark-union将Dataframe行合并为一行

7cjasjjr  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(540)

我有一个dataframe,它包含具有相同id的行。我需要将具有相同id的所有行合并为一行(一个json)
以下是数据示例:

id  first_name   last_name
1    JAMES         SMITH
2    MARY          BROWN
2    DAVID         WILLIAMS
1    ROBERT        DAVIS

请求的结果是:

{
  id:1,
  entities: [{
    first_name:JAMES,
    last_name:SMITH 
   }, {
    first_name:ROBERT,
    last_name:DAVIS
  }]
}
{
  id:2,
  entities: [{
    first_name:MARY,
    last_name:BROWN 
   }, {
    first_name:DAVID,
    last_name:WILLIAMS
  }]
}

能做到吗?
你好,亚尼夫

ohfgkhjo

ohfgkhjo1#

你可以用 groupBy 以及 collect_list 将相关列“合并”到单个嵌套结构中后:

val input: DataFrame = Seq(
  (1, "JAMES", "SMITH"),
  (2, "MARY", "BROWN"),
  (2, "DAVID", "WILLIAMS"),
  (1, "ROBERT", "DAVIS")
).toDF("id", "first_name", "last_name")

import org.apache.spark.sql.functions._
val result = input
  .withColumn("entity", struct($"first_name", $"last_name"))
  .groupBy("id").agg(collect_list($"entity"))

result.show(false)
// +---+--------------------------------+
// |id |entities                        |
// +---+--------------------------------+
// |1  |[[JAMES,SMITH], [ROBERT,DAVIS]] |
// |2  |[[MARY,BROWN], [DAVID,WILLIAMS]]|
// +---+--------------------------------+

result.printSchema()
// root
//  |-- id: integer (nullable = false)
//  |-- entities: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- first_name: string (nullable = true)
//  |    |    |-- last_name: string (nullable = true)

相关问题