pyspark 将数组中行中的字段提取到列

2ic8powd  于 2023-10-15  发布在  Spark
关注(0)|答案(1)|浏览(137)

我正在努力从最高级别的行数组中提取值到列。spark框架的简短版本如下

  1. root
  2. |-- data: struct (nullable = true)
  3. | |-- date: string (nullable = true)
  4. | |-- id: string (nullable = true)
  5. | |-- specifications: array (nullable = true) ## lots of specs here, ~20, they must go to columns
  6. | | |-- element: struct (containsNull = true)
  7. | | | |-- speccode: long (nullable = true) # to ignore
  8. | | | |-- specdecryption: string (nullable = true) # to ignore
  9. | | | |-- specname: string (nullable = true) # column name
  10. | | | |-- specvalue: string (nullable = true) # to value in that column
  11. | |-- begin: long (nullable = true)
  12. | |-- end: long (nullable = true)
  13. |-- kafka_offset: long (nullable = true) # to ignore

字段“规格”是数组,包含约。20、每个都有自己的键和值。所有行的键都相同。specname中的值必须成为列名,specvalue必须进入该列中的值。

  1. 'specifications':
  2. [Row(speccode=123, specdecryption=None, specname='Color', specvalue='red'),
  3. Row(speccode=234, specdecryption=None, specname='Power', specvalue='155'),
  4. Row(speccode=134, specdecryption=None, specname='Speed', specvalue='198'),
  5. Row(speccode=229, specdecryption=None, specname='Length',specvalue='4658'),...]

我需要把它转换成一个框架柱

  1. | date |id | spec_color| spec_power| spec_speed| spec_length| begin | end |
  2. -------------------------------------------------------------------------------------------------
  3. | 2023-08-29| 1 | red | 155 | 198 | 4698 |2023-08-29|2023-08-30|
  4. | 2023-08-29| 2 | blue | 199 | 220 | 4540 |2023-08-29|2023-08-30|
wd2eg0qa

wd2eg0qa1#

data结构体中提取相关列,然后使用inline on specifications column将一个结构体数组分解为一个表,然后透视该结构体以重新塑造它。

  1. cols = ['date', 'id', 'begin', 'end']
  2. (
  3. df
  4. .select(*[F.col('data')[c].alias(c) for c in cols + ['specifications']])
  5. .select(*cols, F.inline('specifications'))
  6. .withColumn('specname', F.expr("'spec_' || lower(specname)"))
  7. .groupby(*cols).pivot('specname').agg(F.first('specvalue'))
  8. )
  1. +----------+---+----------+----------+----------+-----------+----------+----------+
  2. | date| id| begin| end|spec_color|spec_length|spec_power|spec_speed|
  3. +----------+---+----------+----------+----------+-----------+----------+----------+
  4. |2022-02-03| 2|2022-06-02|2022-06-01| red| 4658| 155| 198|
  5. |2022-02-06| 1|2022-06-02|2022-06-01| red| 4658| 155| 198|
  6. +----------+---+----------+----------+----------+-----------+----------+----------+
展开查看全部

相关问题