如何在PySpark中将array(struct(array))扁平化为array(struct)？

brvekthn 于 2024-01-06 发布在 Spark

关注(0)|答案(1)|浏览(198)

DataFrame列包含嵌套数组，嵌套结构位于数组之间（array（struct（array）：

root
 |-- a: integer (nullable = true)
 |-- list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: integer (nullable = true)
 |    |    |-- sub_list: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- c: integer (nullable = true)
 |    |    |    |    |-- foo: string (nullable = true)

字符串
我想把schema简化为一个struct数组：

root
 |-- a: integer (nullable = true)
 |-- list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: integer (nullable = true)
 |    |    |-- c: integer (nullable = true)
 |    |    |-- foo: string (nullable = true)

型
我用的是Spark 3.5.0。
我尝试了flatten，inline，explode，transform等，但没有任何成功。
更新：我发现了以下解决方案，适用于这个简化的示例：

df.select("a", inline("list")) \
.select(expr("*"), inline("sub_list")) \
.drop("sub_list") \
.groupBy("a") \
.agg(collect_list(struct("b", "c", "foo")).alias("list"))

型
但在更复杂的场景中，这会变得更复杂，因为似乎我必须将所有内容自上而下分解（“升级”）到行级别。
我更喜欢下面这样的东西（这会抛出一个错误）：

df.withColumn("list", transform(col("list"), lambda x: x.withField("sub_list", explode(x.getField("sub_list")))))

型
换句话说：我想将内部数组分解为外部数组（自下而上）。

pyspark

来源：https://stackoverflow.com/questions/77627701/how-to-flatten-arraystructarray-to-arraystruct-in-pyspark

1条答案

按热度按时间

e4yzc0pl1#

我想我明白了：

df.withColumn(
    "list",
    flatten(
        transform(
            col("list"),
            lambda x: transform(
                x.getField("sub_list"),
                lambda y: struct(x.getField("b"), y.getField("c"), y.getField("foo")),
            ),
        )
    ),
)

字符串
其逻辑是：
1.使用嵌套转换将列B从外部数组移动到内部数组
1.将转换后的数组展平为单个数组

展开查看全部

赞(0）回复(0）举报 2024-01-06

我来回答

如何在PySpark中将array(struct(array))扁平化为array(struct)？

1条答案

相关问题

热门标签

最新问答