我有一个pyspark Dataframe 如下:
+-----+-------+-----------+------+
|page |group |utc_date |t |
+-----+-------+-----------+------+
| A|12 |2023-01-02 |0.55 |
| A|12 |2023-01-03 |0.6 |
| A|12 |2023-01-04 |1.97 |
| A|12 |2023-01-05 |1.31 |
| B|36 |2023-01-02 |0.09 |
| B|36 |2023-01-03 |0.09 |
| B|36 |2023-01-04 |0.09 |
| B|36 |2023-01-05 |0.02 |
| C|36 |2023-01-02 |0.09 |
| C|36 |2023-01-03 |0.09 |
| C|36 |2023-01-04 |0.09 |
| C|36 |2023-01-05 |0.08 |
+-----+-------+-----------+------+
我想将 Dataframe 转换为滞后2天的时间序列数据集(按页和组分组):
+-----+-------+-----------+------+------+------+
|page |group |utc_date |t |t-1 |t-2 |
+-----+-------+-----------+------+------+------+
| A|12 |2023-01-02 |0.55 |null |null |
| A|12 |2023-01-03 |0.6 |0.55 |null |
| A|12 |2023-01-04 |1.97 |0.6 |0.55 |
| A|12 |2023-01-05 |1.31 |1.97 |0.6 |
| B|36 |2023-01-02 |0.09 |null |null |
| B|36 |2023-01-03 |0.09 |0.09 |null |
| B|36 |2023-01-04 |0.09 |0.09 |0.09 |
| B|36 |2023-01-05 |0.02 |0.09 |0.09 |
| C|36 |2023-01-02 |0.09 |null |null |
| C|36 |2023-01-03 |0.09 |0.09 |null |
| C|36 |2023-01-04 |0.09 |0.09 |0.09 |
| C|36 |2023-01-05 |0.08 |0.09 |0.09 |
+-----+-------+-----------+------+------+------+
在pyspark我该怎么做?
1条答案
按热度按时间2vuwiymt1#
你应该在窗口上使用
lag()
函数,按页面和组排序,并按日期描述排序,例如: