pyspark:如何使用顺序命名将列表分解为多列?

but5z9lq  于 2021-08-25  发布在  Java
关注(0)|答案(1)|浏览(400)

我有以下建议:

+--------+--------------------+--------------------+
|      id|           resoFacts|             heating|
+--------+--------------------+--------------------+
|90179090|[, [No Handicap A...|[Central Heat, Fo...|
+--------+--------------------+--------------------+

由以下人员创建:

(data_filt
     .where(col('id') == '90179090')
     .withColumn('heating', col("resoFacts").getField('heating')))

我想创建一个df,在中展开列表 heating 进入按顺序命名的列,如下所示:

+--------------+------------+----------+----------+---------+
|  id          |heating_1   |heating_2 | heating_3|heating_4|
+--------------+------------+----------+----------+---------+
|  90179090    |Central Heat|Forced Air|      Gas |Heat Pump|
+--------------+------------+----------+----------+---------+

我最远的尝试产生了以下df:

+---+------------+----------+----+---------+
|pos|Central Heat|Forced Air| Gas|Heat Pump|
+---+------------+----------+----+---------+
|  1|        null|Forced Air|null|     null|
|  3|        null|      null| Gas|     null|
|  2|        null|      null|null|Heat Pump|
|  0|Central Heat|      null|null|     null|
+---+------------+----------+----+---------+

使用此代码:

(data_filt
     .where(col('id') == '90179090')
     .withColumn('heating', col("resoFacts").getField('heating'))
     .select("heating", posexplode("heating"))
     .groupBy('pos').pivot('col').agg(first('col')))

我可能是把开头的线路弄错了 groupBy . 有人有想法吗?

shstlldc

shstlldc1#

如果数组中只有4个元素,则只需执行以下操作:

from pyspark.sql import functions as F

data_filt.select(
    "id",
    *(
        F.col("heating").getItem(i).alias(f"heating_{i+1}")
        for i in range(4)
    )
)

增加 range 如果你有更多的元素。

相关问题