我有以下建议:
+--------+--------------------+--------------------+
| id| resoFacts| heating|
+--------+--------------------+--------------------+
|90179090|[, [No Handicap A...|[Central Heat, Fo...|
+--------+--------------------+--------------------+
由以下人员创建:
(data_filt
.where(col('id') == '90179090')
.withColumn('heating', col("resoFacts").getField('heating')))
我想创建一个df,在中展开列表 heating
进入按顺序命名的列,如下所示:
+--------------+------------+----------+----------+---------+
| id |heating_1 |heating_2 | heating_3|heating_4|
+--------------+------------+----------+----------+---------+
| 90179090 |Central Heat|Forced Air| Gas |Heat Pump|
+--------------+------------+----------+----------+---------+
我最远的尝试产生了以下df:
+---+------------+----------+----+---------+
|pos|Central Heat|Forced Air| Gas|Heat Pump|
+---+------------+----------+----+---------+
| 1| null|Forced Air|null| null|
| 3| null| null| Gas| null|
| 2| null| null|null|Heat Pump|
| 0|Central Heat| null|null| null|
+---+------------+----------+----+---------+
使用此代码:
(data_filt
.where(col('id') == '90179090')
.withColumn('heating', col("resoFacts").getField('heating'))
.select("heating", posexplode("heating"))
.groupBy('pos').pivot('col').agg(first('col')))
我可能是把开头的线路弄错了 groupBy
. 有人有想法吗?
1条答案
按热度按时间shstlldc1#
如果数组中只有4个元素,则只需执行以下操作:
增加
range
如果你有更多的元素。