scala—向另一列中的数组添加列值

xiozqbni  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(422)

源json数据

{"ID": "ABC", "Amt": 23077, "col": [{"Seq": 1, "Pct": 1.5, "Sh": 1},{"Seq": 2, "Pct": 1.2, "Sh": 2.5}]}

下部结构

ID:string
Amt:long
Col:array
    element:struct
        Seq:int
        Pct:double
        Sh:double

我有一个Dataframe下面的输出

+----+-------+-----------------------------+
|ID  |Amt    |col                          |
+----+-------+-----------------------------+
|ABC |23077  |[[1, 1.5, 1], [2, 1.2, 2.5]] |
+------------+-----------------------------+

我需要将amt列添加到数组中每个元素末尾的列中。

+----+-------+-------------------------------------------+
|ID  |Amt    |col1                                       |
+----+---------------------------------------------------+
|ABC |23077  |[[1, 1.5, 1, 23077], [2, 1.2, 2.5, 23077]] |
+----+-------+-------------------------------------------+
8tntrjer

8tntrjer1#

如果spark版本>=2.4,则可以使用 transform 向结构添加元素:

val df2 = df.selectExpr(
    "Amt",
    "ID",
    "transform(col, x -> struct(x.Seq as Seq, x.Pct as Pct, x.Sh as Sh, Amt)) as col1"
)

df2.show(false)
+-----+---+--------------------------------------------+
|Amt  |ID |col1                                        |
+-----+---+--------------------------------------------+
|23077|ABC|[[1, 1.5, 1.0, 23077], [2, 1.2, 2.5, 23077]]|
+-----+---+--------------------------------------------+

对于较旧的spark版本,可以分解结构数组并重建它们:

val df2 = df.selectExpr("Amt","ID","inline(col)")
            .groupBy("ID","Amt")
            .agg(collect_list(struct(col("Seq"),col("Pct"),col("Sh"),col("Amt"))).as("col1"))

df2.show(false)
+---+-----+--------------------------------------------+
|ID |Amt  |col1                                        |
+---+-----+--------------------------------------------+
|ABC|23077|[[1, 1.5, 1.0, 23077], [2, 1.2, 2.5, 23077]]|
+---+-----+--------------------------------------------+

相关问题