sql,用于删除重复(连续)的记录,但在开始日期中存储最小日期,在结束日期中存储最大日期

w6mmgewl  于 2021-07-24  发布在  Java
关注(0)|答案(2)|浏览(326)

我在示例表中输入了以下数据:

S_ID  C_ID  E_ID   ST_DT        ED_DT
100   A     11AS   01/01/2020   05/01/2020
100   A     11AS   06/01/2020   10/01/2020
100   A     11AS   11/01/2020   15/01/2020
100   A     11BT   16/01/2020   20/01/2020
100   A     11AS   21/01/2020   27/01/2020
100   A     11AS   28/01/2020   30/01/2020

下表中的预期产出:

S_ID  C_ID  E_ID   ST_DT        ED_DT
100   A     11AS   01/01/2020   15/01/2020
100   A     11BT   16/01/2020   20/01/2020
100   A     11AS   21/01/2020   30/01/2020

数据库:netezza注意:这些是来自数据的示例记录。表中还有其他e\u id。
谢谢

vql8enpb

vql8enpb1#

这是一个缺口和孤岛问题。假设没有间隙,一个简单的方法是行号的差异:

select s_id, c_id, e_id, min(st_dt), max(ed_dt)
from (select t.*,
             row_number() over (partition by s_id, c_id order by st_dt) as seqnum,
             row_number() over (partition by s_id, c_id, e_id order by st_dt) as seqnum_2
      from t
     ) t
group by s_id, c_id, e_id, (seqnum - seqnum_2);
pgpifvop

pgpifvop2#

这实际上是一个缺口和孤岛问题。这些岛屿,如@gordon linoff所说,在clickstream analysis和iot data analysis中也被称为会话。
我将得到一个会话标识符,并在最后按它分组。
嵌套完整选择,每个包含不同的olap函数,以获得的会话id的组结尾,应该可以做到以下几点:

WITH
-- your input ...
input(s_id,c_id,e_id,st_dt,ed_dt) AS (
          SELECT 100  ,'A' , '11AS',DATE '2020-01-01', DATE '2020-01-05'
UNION ALL SELECT 100  ,'A' , '11AS',DATE '2020-01-06', DATE '2020-01-10'
UNION ALL SELECT 100  ,'A' , '11AS',DATE '2020-01-11', DATE '2020-01-15'
UNION ALL SELECT 100  ,'A' , '11BT',DATE '2020-01-16', DATE '2020-01-20'
UNION ALL SELECT 100  ,'A' , '11AS',DATE '2020-01-21', DATE '2020-01-27'
UNION ALL SELECT 100  ,'A' , '11AS',DATE '2020-01-28', DATE '2020-01-30'
)

--当 e_id 不改变,如果改变了。。。

,
with_chg AS (
  SELECT
    *
  , CASE WHEN NVL(LAG(e_id) OVER(ORDER BY st_dt),'') <> e_id THEN 1 ELSE 0 END AS chg
  from input
)
-- SELECT * FROM with_chg; -- check query ....
-- out  s_id | c_id | e_id |   st_dt    |   ed_dt    | chg 
-- out ------+------+------+------------+------------+-----
-- out   100 | A    | 11AS | 2020-01-01 | 2020-01-05 |   1
-- out   100 | A    | 11AS | 2020-01-06 | 2020-01-10 |   0
-- out   100 | A    | 11AS | 2020-01-11 | 2020-01-15 |   0
-- out   100 | A    | 11BT | 2020-01-16 | 2020-01-20 |   1
-- out   100 | A    | 11AS | 2020-01-21 | 2020-01-27 |   1
-- out   100 | A    | 11AS | 2020-01-28 | 2020-01-30 |   0

--获取刚刚获得的列的运行和 chg ,并且您有一个会话标识符。。。

,
with_session AS (
  SELECT
    s_id
  , c_id
  , e_id
  , st_dt
  , ed_dt
  , SUM(chg) OVER(ORDER BY st_dt) AS session
  FROM with_chg
)
-- SELECT * FROM with_session; -- test query ...
-- out  s_id | c_id | e_id |   st_dt    |   ed_dt    | session 
-- out ------+------+------+------------+------------+---------
-- out   100 | A    | 11AS | 2020-01-01 | 2020-01-05 |       1
-- out   100 | A    | 11AS | 2020-01-06 | 2020-01-10 |       1
-- out   100 | A    | 11AS | 2020-01-11 | 2020-01-15 |       1
-- out   100 | A    | 11BT | 2020-01-16 | 2020-01-20 |       2
-- out   100 | A    | 11AS | 2020-01-21 | 2020-01-27 |       3
-- out   100 | A    | 11AS | 2020-01-28 | 2020-01-30 |       3

--现在,最后,按suid、cuid、eèid和会话分组,得到min(stèdt)和max(stèdt)。。。

SELECT
  s_id
, c_id
, e_id
, MIN(st_dt) AS st_dt
, MAX(ed_dt) AS ed_dt
FROM with_session
GROUP BY
  s_id
, c_id
, e_id
, session
ORDER BY 4
;
-- out  s_id | c_id | e_id |   st_dt    |   ed_dt    
-- out ------+------+------+------------+------------
-- out   100 | A    | 11AS | 2020-01-01 | 2020-01-15
-- out   100 | A    | 11BT | 2020-01-16 | 2020-01-20
-- out   100 | A    | 11AS | 2020-01-21 | 2020-01-30

相关问题