agg函数将多行转换为不同类型的多列

eqqqjvef 于 2021-05-27 发布在 Spark

关注(0)|答案(2)|浏览(471)

我想将具有相同id的多行的值转换为列，但每列的类型不同。
输入数据：

val dataInput = List(
  Row( "meta001","duration", 2 , null, null),
  Row("meta001","price", 300 , null , null),
  Row("meta001","name", null , null , "name"),
  Row("meta001","exist", null , true , null),
  Row("meta002","price", 400 , null,  null),
  Row("meta002","duration", 3 , null, null)
)

val schemaInput = new StructType()
  .add("id",StringType,true)
  .add("code",StringType,true)
  .add("integer value",IntegerType,true)
  .add("boolean value",BooleanType,true)
  .add("string value",StringType,true)

var dfInput = spark.createDataFrame(
  spark.sparkContext.parallelize(dataInput),
  schemaInput
)

    +-------+--------+-------------+-------------+------------+
    |     id|    code|integer value|boolean value|string value|
    +-------+--------+-------------+-------------+------------+
    |meta001|duration|            2|         null|        null|
    |meta001|   price|          300|         null|        null|
    |meta001|    name|         null|         null|        name|
    |meta001|   exist|         null|         true|        null|
    |meta002|   price|          400|         null|        null|
    |meta002|duration|            3|         null|        null|
    +-------+--------+-------------+-------------+------------+

预期产量：

+-------+--------+-------------+-------------+------------+
    |     id|duration|price         |name        |exist       |
    +-------+--------+-------------+-------------+------------+
    |meta001|       2|          300|         name|        true|
    |meta002|       3|          400|         null|        null|
    +-------+--------+-------------+-------------+------------+

我想我应该使用groupby和pivot函数，但是当我应该使用agg result时，我有点迷失了：

dfInput.groupby("id").pivot("code",Seq("duration","price","name","exist").agg(???)

scala apache-spark

来源：https://stackoverflow.com/questions/63779246/agg-function-to-transform-multiple-rows-to-multiple-columns-with-different-type

2条答案

按热度按时间

jm2pwxwz1#

你不需要在这里支点，只要合并 first 用一个 when :

dfInput
  .groupBy($"id")
  .agg(
    first(when($"code" === "duration", $"integer value"), ignoreNulls = true).as("duration"),
    first(when($"code" === "price", $"integer value"), ignoreNulls = true).as("price"),
    first(when($"code" === "name", $"string value"), ignoreNulls = true).as("name"),
    first(when($"code" === "exist", $"boolean value"), ignoreNulls = true).as("exist")
  )
  .show()

给予

+-------+--------+-----+----+-----+
|     id|duration|price|name|exist|
+-------+--------+-----+----+-----+
|meta001|       2|  300|name| true|
|meta002|       3|  400|null| null|
+-------+--------+-----+----+-----+

赞(0）回复(0）举报 2021-05-27

5cg8jx4n2#

你可以用 coalesce 函数，但所有列都应该是相同的类型，例如“string”。

df.groupBy("id").pivot("code",Seq("duration","price","name","exist"))
  .agg(first(coalesce($"integer value".cast("string"), $"boolean value".cast("string"), $"string value".cast("string"))))
  .show()

+-------+--------+-----+----+-----+
|     id|duration|price|name|exist|
+-------+--------+-----+----+-----+
|meta001|       2|  300|name| true|
|meta002|       3|  400|null| null|
+-------+--------+-----+----+-----+

如果要保留数据类型，那么函数应该更复杂，可能需要条件语句。或者只需旋转到所有列并选择所需的列。

df.groupBy("id").pivot("code",Seq("duration","price","name","exist"))
  .agg(first($"integer value").as("int"), first($"boolean value").as("bool"), first($"string value").as("string"))
  .select("id", "duration_int", "price_int", "name_string", "exist_bool")
  .show()

+-------+------------+---------+-----------+----------+
|     id|duration_int|price_int|name_string|exist_bool|
+-------+------------+---------+-----------+----------+
|meta001|           2|      300|       name|      true|
|meta002|           3|      400|       null|      null|
+-------+------------+---------+-----------+----------+

root
 |-- id: string (nullable = true)
 |-- duration_int: integer (nullable = true)
 |-- price_int: integer (nullable = true)
 |-- name_string: string (nullable = true)
 |-- exist_bool: boolean (nullable = true)

赞(0）回复(0）举报 2021-05-27

我来回答

agg函数将多行转换为不同类型的多列

2条答案

相关问题

热门标签

最新问答