Pyspark将 Dataframe 转换为时间序列数据,延迟2天

jaql4c8m  于 2023-02-03  发布在  Spark
关注(0)|答案(1)|浏览(99)

我有一个pyspark Dataframe 如下:

+-----+-------+-----------+------+
|page |group  |utc_date   |t     |
+-----+-------+-----------+------+
|    A|12     |2023-01-02 |0.55  |
|    A|12     |2023-01-03 |0.6   |
|    A|12     |2023-01-04 |1.97  |
|    A|12     |2023-01-05 |1.31  |
|    B|36     |2023-01-02 |0.09  |
|    B|36     |2023-01-03 |0.09  |
|    B|36     |2023-01-04 |0.09  |
|    B|36     |2023-01-05 |0.02  |
|    C|36     |2023-01-02 |0.09  |
|    C|36     |2023-01-03 |0.09  |
|    C|36     |2023-01-04 |0.09  |
|    C|36     |2023-01-05 |0.08  |
+-----+-------+-----------+------+

我想将 Dataframe 转换为滞后2天的时间序列数据集(按页和组分组):

+-----+-------+-----------+------+------+------+
|page |group  |utc_date   |t     |t-1   |t-2   |
+-----+-------+-----------+------+------+------+
|    A|12     |2023-01-02 |0.55  |null  |null  |
|    A|12     |2023-01-03 |0.6   |0.55  |null  |
|    A|12     |2023-01-04 |1.97  |0.6   |0.55  |
|    A|12     |2023-01-05 |1.31  |1.97  |0.6   |
|    B|36     |2023-01-02 |0.09  |null  |null  |
|    B|36     |2023-01-03 |0.09  |0.09  |null  |
|    B|36     |2023-01-04 |0.09  |0.09  |0.09  |
|    B|36     |2023-01-05 |0.02  |0.09  |0.09  |
|    C|36     |2023-01-02 |0.09  |null  |null  |
|    C|36     |2023-01-03 |0.09  |0.09  |null  |
|    C|36     |2023-01-04 |0.09  |0.09  |0.09  |
|    C|36     |2023-01-05 |0.08  |0.09  |0.09  |
+-----+-------+-----------+------+------+------+

在pyspark我该怎么做?

2vuwiymt

2vuwiymt1#

你应该在窗口上使用lag()函数,按页面和组排序,并按日期描述排序,例如:

window = Window.partitionBy("page", "group").orderBy("utc_date")
new_df = (
  df
  .withColumn("t-1", lag("t", 1).over(window))
  .withColumn("t-2", lag("t", 2).over(window))
)

相关问题