我想用Dataframe操作对sparksql进行子查询

vaqhlq81  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(323)

我想对以下查询中的Dataframe进行操作。我不知道该怎么做,所以请告诉我。
示例表

| date                    | name            |
| ----------------------- | --------------- |
| 2021-01-01T00:00:00Z    | row1            |
| 2021-01-02T00:00:00Z    | row1            |
| 2021-01-03T00:00:00Z    | row1            |
| 2021-01-03T00:00:00Z    | row1            |
| 2021-01-04T00:00:00Z    | row1            |
| 2021-01-11T00:00:00Z    | row2            |
| 2021-01-12T00:00:00Z    | row2            |
spark.sql('SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY name ORDER BY date DESC) as number FROM test_tables) AS test1 WHERE number <= 3')

执行结果
日期姓名号码2021-01-03t00:00:00zrow112021-01-03t00:00:00zrow122021-01-04t00:00:00zrow132021-01-11t00:00:00zrow212021-01-12t00:00:00zrow22
如何在Dataframe中执行此操作?
Dataframe示例

df.withColumn("name",f.row_number().over(Window.partitionBy("name").orderBy(f.col("date").desc()))
dxpyg8gm

dxpyg8gm1#

您可以简单地使用 withColumn 以及 filter ```
window = Window.partitionBy("name").orderBy(f.col("date").desc())

df.withColumn("number", f.row_number().over(window))
.filter(f.col("number") <= 3)
.show(truncate=False)

结果:

+--------------------+----+------+
|date |name|number|
+--------------------+----+------+
|2021-01-12T00:00:00Z|row2|1 |
|2021-01-11T00:00:00Z|row2|2 |
|2021-01-04T00:00:00Z|row1|1 |
|2021-01-03T00:00:00Z|row1|2 |
|2021-01-03T00:00:00Z|row1|3 |
+--------------------+----+------+

相关问题