实现窗口函数?

sirbozc5  于 2021-06-28  发布在  Hive
关注(0)|答案(1)|浏览(422)

我正在尝试实现以下解决方案:窗口函数
我有以下资料:

+------------+----------------------+-------------------+                                 
|increment_id|base_subtotal_incl_tax|          eventdate|                                 
+------------+----------------------+-------------------+                                 
|        1086|            14470.0000|2016-06-14 09:54:12|                                 
|        1086|            14470.0000|2016-06-14 09:54:12|                                 
|        1086|            14470.0000|2015-07-14 09:54:12|                                 
|        1086|            14470.0000|2015-07-14 09:54:12|                                 
|        1086|            14470.0000|2015-07-14 09:54:12|                                 
|        1086|            14470.0000|2015-07-14 09:54:12|                                 
|        1086|             1570.0000|2015-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
+------------+----------------------+-------------------+

我正在尝试以如下方式运行窗口函数:

WindowSpec window = Window.partitionBy(df.col("id")).orderBy(df.col("eventdate").desc());
df.select(df.col("*"),rank().over(window).alias("rank")) //error for this line
         .filter("rank <= 2")
         .show();

我想要的是为每个用户获取最后两个条目(最后一个是最新日期,但由于它是按日期降序排列的,所以前两行是):

+------------+----------------------+-------------------+                                 
|increment_id|base_subtotal_incl_tax|          eventdate|                                 
+------------+----------------------+-------------------+                                 
|        1086|            14470.0000|2016-06-14 09:54:12|                                 
|        1086|            14470.0000|2016-06-14 09:54:12|   
|        5555|            14470.0000|2014-07-14 09:54:12|                                 
|        5555|            14470.0000|2014-07-14 09:54:12|                                     
+------------+----------------------+-------------------+

但我明白了:

+------------+----------------------+-------------------+----+
|increment_id|base_subtotal_incl_tax|          eventdate|rank|                            
+------------+----------------------+-------------------+----+                            
|        5555|            14470.0000|2014-07-14 09:54:12|   1|                            
|        5555|            14470.0000|2014-07-14 09:54:12|   1|                            
|        5555|            14470.0000|2014-07-14 09:54:12|   1|                            
|        5555|            14470.0000|2014-07-14 09:54:12|   1|                            
|        1086|            14470.0000|2016-06-14 09:54:12|   1|                            
|        1086|            14470.0000|2016-06-14 09:54:12|   1|                            
+------------+----------------------+-------------------+----+

我错过了什么?

mkh04yzy

mkh04yzy1#

所有值都相等->等级相等。尝试 row_number :

df.select(df.col("*"),row_number().over(window).alias("rank"))
     .filter("rank <= 2")
     .show();

相关问题