agg函数将多行转换为不同类型的多列

eqqqjvef  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(471)

我想将具有相同id的多行的值转换为列,但每列的类型不同。
输入数据:

val dataInput = List(
  Row( "meta001","duration", 2 , null, null),
  Row("meta001","price", 300 , null , null),
  Row("meta001","name", null , null , "name"),
  Row("meta001","exist", null , true , null),
  Row("meta002","price", 400 , null,  null),
  Row("meta002","duration", 3 , null, null)
)

val schemaInput = new StructType()
  .add("id",StringType,true)
  .add("code",StringType,true)
  .add("integer value",IntegerType,true)
  .add("boolean value",BooleanType,true)
  .add("string value",StringType,true)

var dfInput = spark.createDataFrame(
  spark.sparkContext.parallelize(dataInput),
  schemaInput
)

    +-------+--------+-------------+-------------+------------+
    |     id|    code|integer value|boolean value|string value|
    +-------+--------+-------------+-------------+------------+
    |meta001|duration|            2|         null|        null|
    |meta001|   price|          300|         null|        null|
    |meta001|    name|         null|         null|        name|
    |meta001|   exist|         null|         true|        null|
    |meta002|   price|          400|         null|        null|
    |meta002|duration|            3|         null|        null|
    +-------+--------+-------------+-------------+------------+

预期产量:

+-------+--------+-------------+-------------+------------+
    |     id|duration|price         |name        |exist       |
    +-------+--------+-------------+-------------+------------+
    |meta001|       2|          300|         name|        true|
    |meta002|       3|          400|         null|        null|
    +-------+--------+-------------+-------------+------------+

我想我应该使用groupby和pivot函数,但是当我应该使用agg result时,我有点迷失了:

dfInput.groupby("id").pivot("code",Seq("duration","price","name","exist").agg(???)
jm2pwxwz

jm2pwxwz1#

你不需要在这里支点,只要合并 first 用一个 when :

dfInput
  .groupBy($"id")
  .agg(
    first(when($"code" === "duration", $"integer value"), ignoreNulls = true).as("duration"),
    first(when($"code" === "price", $"integer value"), ignoreNulls = true).as("price"),
    first(when($"code" === "name", $"string value"), ignoreNulls = true).as("name"),
    first(when($"code" === "exist", $"boolean value"), ignoreNulls = true).as("exist")
  )
  .show()

给予

+-------+--------+-----+----+-----+
|     id|duration|price|name|exist|
+-------+--------+-----+----+-----+
|meta001|       2|  300|name| true|
|meta002|       3|  400|null| null|
+-------+--------+-----+----+-----+
5cg8jx4n

5cg8jx4n2#

你可以用 coalesce 函数,但所有列都应该是相同的类型,例如“string”。

df.groupBy("id").pivot("code",Seq("duration","price","name","exist"))
  .agg(first(coalesce($"integer value".cast("string"), $"boolean value".cast("string"), $"string value".cast("string"))))
  .show()

+-------+--------+-----+----+-----+
|     id|duration|price|name|exist|
+-------+--------+-----+----+-----+
|meta001|       2|  300|name| true|
|meta002|       3|  400|null| null|
+-------+--------+-----+----+-----+

如果要保留数据类型,那么函数应该更复杂,可能需要条件语句。或者只需旋转到所有列并选择所需的列。

df.groupBy("id").pivot("code",Seq("duration","price","name","exist"))
  .agg(first($"integer value").as("int"), first($"boolean value").as("bool"), first($"string value").as("string"))
  .select("id", "duration_int", "price_int", "name_string", "exist_bool")
  .show()

+-------+------------+---------+-----------+----------+
|     id|duration_int|price_int|name_string|exist_bool|
+-------+------------+---------+-----------+----------+
|meta001|           2|      300|       name|      true|
|meta002|           3|      400|       null|      null|
+-------+------------+---------+-----------+----------+

root
 |-- id: string (nullable = true)
 |-- duration_int: integer (nullable = true)
 |-- price_int: integer (nullable = true)
 |-- name_string: string (nullable = true)
 |-- exist_bool: boolean (nullable = true)

相关问题