我想对以下查询中的Dataframe进行操作。我不知道该怎么做,所以请告诉我。
示例表
| date | name |
| ----------------------- | --------------- |
| 2021-01-01T00:00:00Z | row1 |
| 2021-01-02T00:00:00Z | row1 |
| 2021-01-03T00:00:00Z | row1 |
| 2021-01-03T00:00:00Z | row1 |
| 2021-01-04T00:00:00Z | row1 |
| 2021-01-11T00:00:00Z | row2 |
| 2021-01-12T00:00:00Z | row2 |
spark.sql('SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY name ORDER BY date DESC) as number FROM test_tables) AS test1 WHERE number <= 3')
执行结果
日期姓名号码2021-01-03t00:00:00zrow112021-01-03t00:00:00zrow122021-01-04t00:00:00zrow132021-01-11t00:00:00zrow212021-01-12t00:00:00zrow22
如何在Dataframe中执行此操作?
Dataframe示例
df.withColumn("name",f.row_number().over(Window.partitionBy("name").orderBy(f.col("date").desc()))
1条答案
按热度按时间dxpyg8gm1#
您可以简单地使用
withColumn
以及filter
```window = Window.partitionBy("name").orderBy(f.col("date").desc())
df.withColumn("number", f.row_number().over(window))
.filter(f.col("number") <= 3)
.show(truncate=False)
+--------------------+----+------+
|date |name|number|
+--------------------+----+------+
|2021-01-12T00:00:00Z|row2|1 |
|2021-01-11T00:00:00Z|row2|2 |
|2021-01-04T00:00:00Z|row1|1 |
|2021-01-03T00:00:00Z|row1|2 |
|2021-01-03T00:00:00Z|row1|3 |
+--------------------+----+------+