pysparkDataframe上的pivot字符串列

btxsgosb 于 2021-07-13 发布在 Spark

关注(0)|答案(2)|浏览(272)

我有这样一个简单的Dataframe：

rdd = sc.parallelize(
    [
        (0, "A", 223,"201603", "PORT"), 
        (0, "A", 22,"201602", "PORT"), 
        (0, "A", 422,"201601", "DOCK"), 
        (1,"B", 3213,"201602", "DOCK"), 
        (1,"B", 3213,"201601", "PORT"), 
        (2,"C", 2321,"201601", "DOCK")
    ]
)
df_data = sqlContext.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])

df_data.show()
 +---+----+----+------+----+
| id|type|cost|  date|ship|
+---+----+----+------+----+
|  0|   A| 223|201603|PORT|
|  0|   A|  22|201602|PORT|
|  0|   A| 422|201601|DOCK|
|  1|   B|3213|201602|DOCK|
|  1|   B|3213|201601|PORT|
|  2|   C|2321|201601|DOCK|
+---+----+----+------+----+

我需要按日期来调整它：

df_data.groupby(df_data.id, df_data.type).pivot("date").avg("cost").show()

+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
|  2|   C|2321.0|  null|  null|
|  0|   A| 422.0|  22.0| 223.0|
|  1|   B|3213.0|3213.0|  null|
+---+----+------+------+------+

一切正常。但现在我需要旋转它并得到一个非数字列：

df_data.groupby(df_data.id, df_data.type).pivot("date").avg("ship").show()

当然，我会得到一个例外：

AnalysisException: u'"ship" is not a numeric column. Aggregation function can only be applied on a numeric column.;'

我想在

+---+----+------+------+------+
| id|type|201601|201602|201603|
+---+----+------+------+------+
|  2|   C|DOCK  |  null|  null|
|  0|   A| DOCK |  PORT| DOCK|
|  1|   B|DOCK  |PORT  |  null|
+---+----+------+------+------+

有可能吗 pivot ?

来源：https://stackoverflow.com/questions/66330271/how-to-transfer-long-data-into-wide-data-with-strings-in-pyspark

2条答案

按热度按时间

9bfwbjaz1#

假设 (id |type | date) 组合是唯一的，您的唯一目标是旋转，而不是可以使用的聚合 first （或不限于数值的任何其他函数）：

from pyspark.sql.functions import first

(df_data
    .groupby(df_data.id, df_data.type)
    .pivot("date")
    .agg(first("ship"))
    .show())

## +---+----+------+------+------+

## | id|type|201601|201602|201603|

## +---+----+------+------+------+

## |  2|   C|  DOCK|  null|  null|

## |  0|   A|  DOCK|  PORT|  PORT|

## |  1|   B|  PORT|  DOCK|  null|

## +---+----+------+------+------+

如果这些假设不正确，你将不得不预先汇总你的数据。例如最常见的 ship 价值：

from pyspark.sql.functions import max, struct

(df_data
    .groupby("id", "type", "date", "ship")
    .count()
    .groupby("id", "type")
    .pivot("date")
    .agg(max(struct("count", "ship")))
    .show())

## +---+----+--------+--------+--------+

## | id|type|  201601|  201602|  201603|

## +---+----+--------+--------+--------+

## |  2|   C|[1,DOCK]|    null|    null|

## |  0|   A|[1,DOCK]|[1,PORT]|[1,PORT]|

## |  1|   B|[1,PORT]|[1,DOCK]|    null|

## +---+----+--------+--------+--------+

赞(0）回复(0）举报 2021-07-13

ckx4rj1h2#

在这种情况下，如果有人正在寻找sql风格的方法。

rdd = spark.sparkContext.parallelize(
    [
        (0, "A", 223,"201603", "PORT"), 
        (0, "A", 22,"201602", "PORT"), 
        (0, "A", 422,"201601", "DOCK"), 
        (1,"B", 3213,"201602", "DOCK"), 
        (1,"B", 3213,"201601", "PORT"), 
        (2,"C", 2321,"201601", "DOCK")
    ]
)
df_data = spark.createDataFrame(rdd, ["id","type", "cost", "date", "ship"])
df_data.createOrReplaceTempView("df")
df_data.show()

dt_vals=spark.sql("select collect_set(date) from df").collect()[0][0]
['201601', '201602', '201603']

dt_vals_colstr=",".join(["'" + c + "'" for c in sorted(dt_vals)])
"'201601','201602','201603'"

第1部分（注意 f 格式说明符）

spark.sql(f"""
select * from 
(select id , type, date, ship from df)
pivot (
first(ship) for date in ({dt_vals_colstr})
)
""").show(100,truncate=False)

+---+----+------+------+------+
|id |type|201601|201602|201603|
+---+----+------+------+------+
|1  |B   |PORT  |DOCK  |null  |
|2  |C   |DOCK  |null  |null  |
|0  |A   |DOCK  |PORT  |PORT  |
+---+----+------+------+------+

第二部分

spark.sql(f"""
select * from 
(select id , type, date, ship from df)
pivot (
case when count(*)=0 then null 
else struct(count(*),first(ship)) end for date in ({dt_vals_colstr})
)
""").show(100,truncate=False)

+---+----+---------+---------+---------+
|id |type|201601   |201602   |201603   |
+---+----+---------+---------+---------+
|1  |B   |[1, PORT]|[1, DOCK]|null     |
|2  |C   |[1, DOCK]|null     |null     |
|0  |A   |[1, DOCK]|[1, PORT]|[1, PORT]|
+---+----+---------+---------+---------+

赞(0）回复(0）举报 2021-07-13

我来回答

pysparkDataframe上的pivot字符串列

2条答案

相关问题

热门标签

最新问答