我有这样的数据:
+---+----+----------+--------+
| id|hash|start_date|end_date|
+---+----+----------+--------+
| 1| a| 2012| 2013|
| 1| b| 2014| 2015|
| 1| a| 2016| 2017|
| 1| a| 2018| 2019|
+---+----+----------+--------+
我想合并具有相同值的句点。因此,我希望得到如下数据:
+---+----+----------+--------+
| id|hash|start_date|end_date|
+---+----+----------+--------+
| 1| a| 2012| 2013|
| 1| b| 2014| 2015|
| 1| a| 2016| 2019|
+---+----+----------+--------+
(最后两行合并为一个句点)
我试过这样一个问题:
%sql
select distinct
id,
hash,
min(start_date) over(partition by hash) as start_date,
max(end_date) over(partition by hash) as end_date
from (
select 1 as id, 'a' as hash, 2012 as start_date, 2013 as end_date
union
select 1 as id, 'b' as hash, 2014 as start_date, 2015 as end_date
union
select 1 as id, 'a' as hash, 2016 as start_date, 2017 as end_date
union
select 1 as id, 'a' as hash, 2018 as start_date, 2019 as end_date
) t
结果是
+---+----+----------+--------+
| id|hash|start_date|end_date|
+---+----+----------+--------+
| 1| a| 2012| 2019|
| 1| b| 2014| 2015|
+---+----+----------+--------+
这是错误的,因为2012-2013年和2016-2019年应该分开。
如何使用spark sql获得正确的结果?
2条答案
按热度按时间izj3ouym1#
这是一个缺口和岛屿问题。最简单的方法是行号之差。如果没有间隙,这将起作用:
h79rfbju2#
这是一个缺口和孤岛问题。这里有一种方法
lag()
还有一扇Windowssum
定义组。这种方法的优点是,它允许在不同的时间序列上同时出现周期id
s。考虑: