使用JavaSpark将嵌套数组分解为新列

dly7yett  于 2023-03-09  发布在  Apache
关注(0)|答案(1)|浏览(131)

我有一个嵌套数组,我想在其中获取所有元素,并将它们分别放入一个新列。这是我目前所拥有的。尝试编写2个方法,但都不起作用。我从未注解代码中得到的当前错误是
cannot resolve 'split(response.indicator, ',')' due to data type mismatch: argument 1 requires string type, however, 'response.indicator' is of array<struct<_VALUE:string,_number:bigint>> type.;;

File.withColumn("response.indicator", explode(col("response.ind")))
            .withColumn("response.indicator", split(col("response.indicator"), ","))
            .withColumn("key", col("response.indicator").getItem(1))
            .withColumn("value", col("response.indicator").getItem(0))
            .groupBy("ID")
            .pivot("key")
            .agg(first("value"))
            .show(true);

    /*File.select("response.indicator").collectAsList().forEach(row -> {
        String name = String.valueOf(row.getList(0).get(1));
        String value = String.valueOf(row.getList(0).get(0));
        File.withColumn(name, col(value));
    });*/

下面是架构

|-- ID: integer (nullable = true)
 |-- response: struct (nullable = true)
 |    |-- indicator: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _number: long (nullable = true)

我的数据的外观

+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ID                  |response     
+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1                  |[WrappedArray([N,7], [N,8], [N,9], [N,19], [N,20], [N,22], [N,12], [N,1], [N,2], [N,3], [N,4], [N,5], [N,6], [N,10], [N,11], [N,13], [N,14], [N,15], [N,16], [N,17], [N,18], [N,21], [N,25], [N,26])]  |  
| 2                  |[WrappedArray([Y,1], [N,8], [N,9], [N,19], [N,22], [Y,22], [N,20], [Y,7], [Y,23], [N,3], [Y,4], [N,11], [N,6], [Y,27], [N,5], [N,13], [N,14], [N,15], [Y,16], [N,17], [Y,18], [N,21], [N,25], [N,26])]    |
+--------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

我想让它看起来像

+--------------------+-----------------------------+
|ID                  | 1 | 2    | 3 |  etc 
+--------------------+-----------------------------+
| 1                  | N | N    | N | etc 
| 2                  | Y | NULL | N |  etc
+--------------------+-----------------------------+
lp0sw83n

lp0sw83n1#

问题是split(col("response.indicator"), ",")需要一个字符串列,而response.indicator实际上是一个struct。要“展开”一个名为sstruct,可以按如下所示使用s.*

// I use the names provided in the schema, not the ones from your code.
File.withColumn("indicator", explode(col("response.indicator")))
    .select("ID", "indicator.*")
    .groupBy("ID")
    .pivot("_number")
    .agg(first("_value"))
    .show();

相关问题