pyspark:将月份级别的记录转换为基于单个列的季度级别记录

yyhrrdl8  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(363)

我有一个Dataframe,格式如下:

+---------+-------+------------+------------------+--------+-------+         
| quarter | month | month_rank | unique_customers | units  | sales |
+---------+-------+------------+------------------+--------+-------+
-
|   1     |   1   |     1      |  15              | 30     | 1000  |
--------------------------------------------------------------------
|   1     |   2   |     2      |  20              | 35     | 1200  |
--------------------------------------------------------------------
|   1     |   3   |     3      |  18              | 40     | 1500  |
--------------------------------------------------------------------
|   2     |   4   |     1      |  10              | 25     | 800   |
--------------------------------------------------------------------
|   2     |   5   |     2      |  25              | 50     | 2000  |
--------------------------------------------------------------------
|   2     |   6   |     3      |  28              | 45     | 1800  |
...

我正在尝试按季度分组,并以列式方式跟踪月度销售额,如下所示:

+---------+--------------+------------+------------------+--------+-------+
| quarter | month_rank1 | rank1_unique_customers | rank1_units | rank1_sales | month_rank2 | rank2_unique_customers | rank2_units | rank2_sales | month_rank3 | rank3_unique_customers | rank3_units | rank3_sales |
+---------+--------------+------------+------------------+--------+-------+
|   1     |   1   |  15|30|1000|   2  |20|35|1200  |   3  |18|40|1500
---------------------------------------------------------------------
|   2     |   4   |  10|25|800 |   5  |25|50|2000  |   6  |28|45|1800
---------------------------------------------------------------------

这是否可以通过多个枢轴实现?我没有从一个轴创建多个列的运气。我在想,我也许可以通过开窗来达到这个效果,但如果有人遇到类似的问题,任何建议都会得到极大的赞赏。谢谢您!

n7taea2i

n7taea2i1#

在上使用轴 month_rank 列然后 agg 其他列。 Example: ```
df=spark.createDataFrame([(1,1,1,15,30,1000),(1,2,2,20,35,1200),(1,3,3,18,40,1500),(2,4,1,10,25,800),(2,5,2,25,50,2000),(2,6,3,28,45,1800)],["quarter","month","month_rank","unique_customers","units","sales"])

df.show()

+-------+-----+----------+----------------+-----+-----+

|quarter|month|month_rank|unique_customers|units|sales|

+-------+-----+----------+----------------+-----+-----+

| 1| 1| 1| 15| 30| 1000|

| 1| 2| 2| 20| 35| 1200|

| 1| 3| 3| 18| 40| 1500|

| 2| 4| 1| 10| 25| 800|

| 2| 5| 2| 25| 50| 2000|

| 2| 6| 3| 28| 45| 1800|

+-------+-----+----------+----------------+-----+-----+

from pyspark.sql.functions import *
df1=df.
groupBy("quarter").
pivot("month_rank").
agg(first(col("month")),first(col("unique_customers")),first(col("units")),first(col("sales")))

cols=["quarter","month_rank1","rank1_unique_customers","rank1_units","rank1_sales","month_rank2","rank2_unique_customers","rank2_units","rank2_sales","month_rank3","rank3_unique_customers","rank3_units","rank3_sales"]

df1.toDF(*cols).show()

+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+

|quarter|month_rank1|rank1_unique_customers|rank1_units|rank1_sales|month_rank2|rank2_unique_customers|rank2_units|rank2_sales|month_rank3|rank3_unique_customers|rank3_units|rank3_sales|

+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+

| 1| 1| 15| 30| 1000| 2| 20| 35| 1200| 3| 18| 40| 1500|

| 2| 4| 10| 25| 800| 5| 25| 50| 2000| 6| 28| 45| 1800|

+-------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+----------------------+-----------+-----------+

相关问题