我有一个具有以下模式的数据集
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- subEntities: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- status: string (nullable = true)
| | |-- subEntityId: long (nullable = true)
| | |-- subEntityName: string (nullable = true)
``` `dataset.select($"id", $"name", $"subEntities.subEntityId", $"subEntities.subEntityName")` 放 `subEntityId` 以及 `subEntityName` 分成不同的阵列。如何选择多个列并将它们放入单个数组中?
2条答案
按热度按时间68bkxrlz1#
.withColumn("status",col("subEntities").getField("status"))
.withColumn("subEntityId",col("subEntities").getField("subEntityId"))
从数组中提取值下面是工作示例
sxpgvts32#
如果工作在
Spark >= 2.4
可以使用transform函数生成包含原始数组字段子集的数组: