我有一个Dataframe,格式如下:
+---------+-------+------------+------------------+--------+-------+
| quarter | month | month_rank | unique_customers | units | sales |
+---------+-------+------------+------------------+--------+-------+
-
| 1 | 1 | 1 | 15 | 30 | 1000 |
--------------------------------------------------------------------
| 1 | 2 | 2 | 20 | 35 | 1200 |
--------------------------------------------------------------------
| 1 | 3 | 3 | 18 | 40 | 1500 |
--------------------------------------------------------------------
| 2 | 4 | 1 | 10 | 25 | 800 |
--------------------------------------------------------------------
| 2 | 5 | 2 | 25 | 50 | 2000 |
--------------------------------------------------------------------
| 2 | 6 | 3 | 28 | 45 | 1800 |
...
我正在尝试按季度分组,并以列式方式跟踪月度销售额,如下所示:
+---------+--------------+------------+------------------+--------+-------+
| quarter | month_rank1 | rank1_unique_customers | rank1_units | rank1_sales | month_rank2 | rank2_unique_customers | rank2_units | rank2_sales | month_rank3 | rank3_unique_customers | rank3_units | rank3_sales |
+---------+--------------+------------+------------------+--------+-------+
| 1 | 1 | 15|30|1000| 2 |20|35|1200 | 3 |18|40|1500
---------------------------------------------------------------------
| 2 | 4 | 10|25|800 | 5 |25|50|2000 | 6 |28|45|1800
---------------------------------------------------------------------
这是否可以通过多个枢轴实现?我没有从一个轴创建多个列的运气。我在想,我也许可以通过开窗来达到这个效果,但如果有人遇到类似的问题,任何建议都会得到极大的赞赏。谢谢您!
1条答案
按热度按时间n7taea2i1#
在上使用轴
month_rank
列然后agg
其他列。Example:
```df=spark.createDataFrame([(1,1,1,15,30,1000),(1,2,2,20,35,1200),(1,3,3,18,40,1500),(2,4,1,10,25,800),(2,5,2,25,50,2000),(2,6,3,28,45,1800)],["quarter","month","month_rank","unique_customers","units","sales"])
df.show()
+-------+-----+----------+----------------+-----+-----+
|quarter|month|month_rank|unique_customers|units|sales|
+-------+-----+----------+----------------+-----+-----+
| 1| 1| 1| 15| 30| 1000|
| 1| 2| 2| 20| 35| 1200|
| 1| 3| 3| 18| 40| 1500|
| 2| 4| 1| 10| 25| 800|
| 2| 5| 2| 25| 50| 2000|
| 2| 6| 3| 28| 45| 1800|
+-------+-----+----------+----------------+-----+-----+
from pyspark.sql.functions import *
df1=df.
groupBy("quarter").
pivot("month_rank").
agg(first(col("month")),first(col("unique_customers")),first(col("units")),first(col("sales")))
cols=["quarter","month_rank1","rank1_unique_customers","rank1_units","rank1_sales","month_rank2","rank2_unique_customers","rank2_units","rank2_sales","month_rank3","rank3_unique_customers","rank3_units","rank3_sales"]
df1.toDF(*cols).show()
+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+
|quarter|month_rank1|rank1_unique_customers|rank1_units|rank1_sales|month_rank2|rank2_unique_customers|rank2_units|rank2_sales|month_rank3|rank3_unique_customers|rank3_units|rank3_sales|
+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+
| 1| 1| 15| 30| 1000| 2| 20| 35| 1200| 3| 18| 40| 1500|
| 2| 4| 10| 25| 800| 5| 25| 50| 2000| 6| 28| 45| 1800|
+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+