我正在努力从最高级别的行数组中提取值到列。spark框架的简短版本如下
root
|-- data: struct (nullable = true)
| |-- date: string (nullable = true)
| |-- id: string (nullable = true)
| |-- specifications: array (nullable = true) ## lots of specs here, ~20, they must go to columns
| | |-- element: struct (containsNull = true)
| | | |-- speccode: long (nullable = true) # to ignore
| | | |-- specdecryption: string (nullable = true) # to ignore
| | | |-- specname: string (nullable = true) # column name
| | | |-- specvalue: string (nullable = true) # to value in that column
| |-- begin: long (nullable = true)
| |-- end: long (nullable = true)
|-- kafka_offset: long (nullable = true) # to ignore
字段“规格”是数组,包含约。20、每个都有自己的键和值。所有行的键都相同。specname中的值必须成为列名,specvalue必须进入该列中的值。
'specifications':
[Row(speccode=123, specdecryption=None, specname='Color', specvalue='red'),
Row(speccode=234, specdecryption=None, specname='Power', specvalue='155'),
Row(speccode=134, specdecryption=None, specname='Speed', specvalue='198'),
Row(speccode=229, specdecryption=None, specname='Length',specvalue='4658'),...]
我需要把它转换成一个框架柱
| date |id | spec_color| spec_power| spec_speed| spec_length| begin | end |
-------------------------------------------------------------------------------------------------
| 2023-08-29| 1 | red | 155 | 198 | 4698 |2023-08-29|2023-08-30|
| 2023-08-29| 2 | blue | 199 | 220 | 4540 |2023-08-29|2023-08-30|
1条答案
按热度按时间wd2eg0qa1#
从
data
结构体中提取相关列,然后使用inline onspecifications
column将一个结构体数组分解为一个表,然后透视该结构体以重新塑造它。