pyspark 将数组中行中的字段提取到列

2ic8powd 于 2023-10-15 发布在 Spark

关注(0)|答案(1)|浏览(137)

我正在努力从最高级别的行数组中提取值到列。spark框架的简短版本如下

root
 |-- data: struct (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- specifications: array (nullable = true)  ## lots of specs here, ~20, they must go to columns
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- speccode: long (nullable = true)   # to ignore
 |    |    |    |-- specdecryption: string (nullable = true)  # to ignore
 |    |    |    |-- specname: string (nullable = true)   # column name
 |    |    |    |-- specvalue: string (nullable = true)  # to value in that column
 |    |-- begin: long (nullable = true)
 |    |-- end: long (nullable = true)
 |-- kafka_offset: long (nullable = true)  # to ignore

字段“规格”是数组，包含约。20、每个都有自己的键和值。所有行的键都相同。specname中的值必须成为列名，specvalue必须进入该列中的值。

'specifications': 
[Row(speccode=123, specdecryption=None, specname='Color', specvalue='red'),
 Row(speccode=234, specdecryption=None, specname='Power', specvalue='155'),
 Row(speccode=134, specdecryption=None, specname='Speed', specvalue='198'),
 Row(speccode=229, specdecryption=None, specname='Length',specvalue='4658'),...]

我需要把它转换成一个框架柱

|  date     |id | spec_color| spec_power| spec_speed| spec_length|  begin   |   end    |
-------------------------------------------------------------------------------------------------
| 2023-08-29| 1 |   red     |  155      |  198      | 4698       |2023-08-29|2023-08-30|
| 2023-08-29| 2 |  blue     |  199      |  220      | 4540       |2023-08-29|2023-08-30|

pyspark

来源：https://stackoverflow.com/questions/77001530/extract-field-in-rows-in-array-to-columns

1条答案

按热度按时间

wd2eg0qa1#

从data结构体中提取相关列，然后使用inline on specifications column将一个结构体数组分解为一个表，然后透视该结构体以重新塑造它。

cols = ['date', 'id', 'begin', 'end']
(
    df
    .select(*[F.col('data')[c].alias(c) for c in cols + ['specifications']])
    .select(*cols, F.inline('specifications'))
    .withColumn('specname', F.expr("'spec_' || lower(specname)"))
    .groupby(*cols).pivot('specname').agg(F.first('specvalue'))
)

+----------+---+----------+----------+----------+-----------+----------+----------+
|      date| id|     begin|       end|spec_color|spec_length|spec_power|spec_speed|
+----------+---+----------+----------+----------+-----------+----------+----------+
|2022-02-03|  2|2022-06-02|2022-06-01|       red|       4658|       155|       198|
|2022-02-06|  1|2022-06-02|2022-06-01|       red|       4658|       155|       198|
+----------+---+----------+----------+----------+-----------+----------+----------+

展开查看全部

赞(0）回复(0）举报 2023-10-15

我来回答

pyspark 将数组中行中的字段提取到列

1条答案

相关问题

热门标签

最新问答