如何使用pyspark.sql.function.lag将列移动整整一周,而不是只移动一行?

hi3rlvi2  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(335)

sales\u shift-wrong是我当前代码的输出,sales\u shift right是所需的输出:
电流(不正确)输出和期望输出的数据和示例
数据:
dateyearmonthweekitemdepartmentstorestates销售修正01/01/2020200111111592tx$149674.59null02/01/20202001111592tx$101260.73$149674.59null03/01/2020200111111592tx$119931.46$101260.73null04/01/20202001111592tx$209863.86$119931.46null05/01/2020200111111592tx$426471.36$209863.86null06/01/202020012111592tx$377,860.85$426471.36$149674.5907/01/202020012111592tx$127632.41$377860.85$101260.7308/01/202020012111592tx$80207.47$127632.41$119931.4609/01/202020012111592tx$62149.50$80207.47$209863.8610/01/202020012111592tx$67399.59$62149.50$426471.3611/01/202020012111592tx$90867.58$67399.5912/01/202020012111592tx$113,211.99$90867.5816/01/202020013111592tx$67096.92$113211.99$377860.8517/01/202020013111592tx$68440.03$67096.92$127632.4118/01/202020013111592tx$91116.59$68440.03$80207.4719/01/202020013111592tx$119986.25$91116.59$62149.5013/01/202020013111592tx$72911.67$119986.25$67399.5914/01/202020013111592tx$87,993.66美元72911.67美元90867.5815/01/202020013111592TX美元69015.89美元87993.66美元113211.99
我认为正确的解决方法( sales_shift_right )是通过使用一个窗口函数,但是,我还没有找到参数的组合来获得我想要的结果。

partitions = ['store', 'state_prov_cd', 'department', 'item', 'year', 'month', 'week']
date_col = ['week']

w = (
    Window
    .partitionBy(partitions)
    .orderBy(date_col)
  )

df_sales_shifted = (
    data
    .withColumn('sales_shifted', f.lag('sales', 1).over(w))
    .sort(partitions)
  )

有人能提出更好的方法或发现错误吗?

2guxujil

2guxujil1#

您可以再添加一列 row_number 为了便于分区:

from pyspark.sql import functions as F, Window

partitions = ['store', 'state', 'department', 'item', 'year', 'month', 'week']
partitions2 = ['store', 'state', 'department', 'item', 'rn']
date_col = ['week']

w = Window.partitionBy(partitions).orderBy(date_col)
w2 = Window.partitionBy(partitions2).orderBy(date_col)

df2 = df.withColumn(
    'rn',
    F.row_number().over(w)
).withColumn(
    'sales_shifted', 
    F.lag('sales').over(w2)
).drop('rn').orderBy('date')

df2.show()
+----------+----+-----+----+----+----------+-----+-----+------------+-------------+
|      date|year|month|week|item|department|store|state|       sales|sales_shifted|
+----------+----+-----+----+----+----------+-----+-----+------------+-------------+
|01/01/2020|2020|    1|   1|   1|         1| 1592|   TX|$ 149,674.59|         null|
|02/01/2020|2020|    1|   1|   1|         1| 1592|   TX|$ 101,260.73|         null|
|03/01/2020|2020|    1|   1|   1|         1| 1592|   TX|$ 119,931.46|         null|
|04/01/2020|2020|    1|   1|   1|         1| 1592|   TX|$ 209,863.86|         null|
|05/01/2020|2020|    1|   1|   1|         1| 1592|   TX|$ 426,471.36|         null|
|06/01/2020|2020|    1|   2|   1|         1| 1592|   TX|$ 377,860.85| $ 149,674.59|
|07/01/2020|2020|    1|   2|   1|         1| 1592|   TX|$ 127,632.41| $ 101,260.73|
|08/01/2020|2020|    1|   2|   1|         1| 1592|   TX| $ 80,207.47| $ 119,931.46|
|09/01/2020|2020|    1|   2|   1|         1| 1592|   TX| $ 62,149.50| $ 209,863.86|
|10/01/2020|2020|    1|   2|   1|         1| 1592|   TX| $ 67,399.59| $ 426,471.36|
|11/01/2020|2020|    1|   2|   1|         1| 1592|   TX| $ 90,867.58|         null|
|12/01/2020|2020|    1|   2|   1|         1| 1592|   TX|$ 113,211.99|         null|
|13/01/2020|2020|    1|   3|   1|         1| 1592|   TX| $ 72,911.67|  $ 67,399.59|
|14/01/2020|2020|    1|   3|   1|         1| 1592|   TX| $ 87,993.66|  $ 90,867.58|
|15/01/2020|2020|    1|   3|   1|         1| 1592|   TX| $ 69,015.89| $ 113,211.99|
|16/01/2020|2020|    1|   3|   1|         1| 1592|   TX| $ 67,096.92| $ 377,860.85|
|17/01/2020|2020|    1|   3|   1|         1| 1592|   TX| $ 68,440.03| $ 127,632.41|
|18/01/2020|2020|    1|   3|   1|         1| 1592|   TX| $ 91,116.59|  $ 80,207.47|
|19/01/2020|2020|    1|   3|   1|         1| 1592|   TX|$ 119,986.25|  $ 62,149.50|
+----------+----+-----+----+----+----------+-----+-----+------------+-------------+

相关问题