mysql 如何以连续的方式划分活动数据-间隙和孤岛问题

iecba09b  于 2023-03-11  发布在  Mysql
关注(0)|答案(1)|浏览(116)

I have an activity table called myTable, the data looks like this:

CREATE TABLE myTable
(
    userid      text,
    webid         text,
    "ts" timestamp
);

INSERT INTO myTable
    ("userid", "webid", "ts")
VALUES ('1', 'A', '34', '2023-01-31 16:34:49.000'),
       ('2', 'A', '73', '2023-01-31 16:34:50.000'),
       ('3', 'A', '97', '2023-01-31 16:34:58.000'),
       ('4', 'A', '17', '2023-01-31 17:35:02.000'),
       ('5', 'A', '17', '2023-01-31 17:35:07.000'),
       ('6', 'A', '17', '2023-01-31 17:35:18.000'),
       ('7', 'A', '17', '2023-01-31 17:35:30.000'),
       ('8', 'A', '1', '2023-01-31 17:35:37.000'),
       ('9', 'A', '1', '2023-01-31 17:35:38.000'),
       ('10', 'A', '77', '2023-01-31 17:35:41.000'),
       ('11', 'A', '77', '2023-01-31 17:35:42.000'),
       ('12', 'A', '15', '2023-01-31 17:35:42.000'),
       ('13', 'A', '44', '2023-01-31 17:35:42.000'),
       ('14', 'A', '1', '2023-01-31 17:35:42.000'),
       ('15', 'A', '77', '2023-01-31 17:35:45.000'),
       ('16', 'A', '44', '2023-01-31 17:35:45.000'),
       ('17', 'A', '1', '2023-01-31 17:37:10.000'),
       ('18', 'A', '1', '2023-01-31 17:37:12.000'),
       ('19', 'A', '77', '2023-01-31 17:37:14.000'),
       ('20', 'A', '77', '2023-01-31 17:52:14.000'),
       ('21', 'A', '77', '2023-01-31 18:12:14.000'),
       ('22', 'A', '77', '2023-01-31 18:45:14.000'),
       ('23', 'A', '77', '2023-01-31 18:55:15.000'),
       ('1', 'B', '33', '2023-01-31 06:37:15.000'),
       ('2', 'B', '56', '2023-01-31 06:40:15.000')
  ;
rowiduseridwebidts
1A342023-01-31 16:34:49
2A732023-01-31 16:34:50
3A972023-01-31 16:34:58
4A172023-01-31 17:35:02
5A172023-01-31 17:35:07
6A172023-01-31 17:35:18
7A172023-01-31 17:35:30
8A12023-01-31 17:35:37
9A12023-01-31 17:35:38
10A772023-01-31 17:35:41
11A772023-01-31 17:35:42
12A152023-01-31 17:35:42
13A442023-01-31 17:35:42
14A12023-01-31 17:35:42
15A772023-01-31 17:35:45
16A442023-01-31 17:35:45
17A12023-01-31 17:37:10
18A12023-01-31 17:37:12
19A772023-01-31 17:37:14
20A772023-01-31 17:52:14
21A772023-01-31 18:12:14
22A772023-01-31 18:45:14
23A772023-01-31 18:55:15
1B332023-01-31 06:37:15
2B562023-01-31 06:40:15

The output I want to return as following:
| rowid | userid | webid | session_id | ts | first_ts |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| 1 | A | 34 | 1 | 2023-01-31 16:34:49 | 2023-01-31 16:34:49 |
| 2 | A | 73 | 2 | 2023-01-31 16:34:50 | 2023-01-31 16:34:50 |
| 3 | A | 97 | 3 | 2023-01-31 16:34:58 | 2023-01-31 16:34:58 |
| 4 | A | 17 | 4 | 2023-01-31 17:35:02 | 2023-01-31 17:35:02 |
| 5 | A | 17 | 4 | 2023-01-31 17:35:07 | 2023-01-31 17:35:02 |
| 6 | A | 17 | 4 | 2023-01-31 17:35:18 | 2023-01-31 17:35:02 |
| 7 | A | 17 | 4 | 2023-01-31 17:35:30 | 2023-01-31 17:35:02 |
| 8 | A | 1 | 5 | 2023-01-31 17:35:37 | 2023-01-31 17:35:37 |
| 9 | A | 1 | 5 | 2023-01-31 17:35:38 | 2023-01-31 17:35:37 |
| 10 | A | 77 | 6 | 2023-01-31 17:35:41 | 2023-01-31 17:35:41 |
| 11 | A | 77 | 6 | 2023-01-31 17:35:42 | 2023-01-31 17:35:42 |
| 12 | A | 15 | 7 | 2023-01-31 17:35:42 | 2023-01-31 17:35:42 |
| 13 | A | 44 | 8 | 2023-01-31 17:35:42 | 2023-01-31 17:35:42 |
| 14 | A | 1 | 9 | 2023-01-31 17:35:42 | 2023-01-31 17:35:42 |
| 15 | A | 77 | 10 | 2023-01-31 17:35:45 | 2023-01-31 17:35:45 |
| 16 | A | 44 | 11 | 2023-01-31 17:35:45 | 2023-01-31 17:35:45 |
| 17 | A | 1 | 12 | 2023-01-31 17:37:10 | 2023-01-31 17:37:10 |
| 18 | A | 1 | 12 | 2023-01-31 17:37:12 | 2023-01-31 17:37:10 |
| 19 | A | 77 | 13 | 2023-01-31 17:37:14 | 2023-01-31 17:37:14 |
| 20 | A | 77 | 13 | 2023-01-31 17:52:14 | 2023-01-31 17:37:14 |
| 21 | A | 77 | 13 | 2023-01-31 18:12:14 | 2023-01-31 17:37:14 |
| 22 | A | 77 | 14 | 2023-01-31 18:45:14 | 2023-01-31 18:45:14 |
| 23 | A | 77 | 14 | 2023-01-31 18:55:15 | 2023-01-31 18:45:14 |
| 1 | B | 33 | 1 | 2023-01-31 06:37:15 | 2023-01-31 06:37:15 |
| 2 | B | 56 | 2 | 2023-01-31 06:40:15 | 2023-01-31 06:40:15 |
first_ts means first timestamp. The logic for first_ts is If a userid visits same webid at consecutive timestamp and each timestamp's interval falls within 30 minutes, these will be recorded as a session. For example, row 4,5,6,7. Each time interval are less than 30 minutes, their first_ts is the timestamp of the first event, which is 2023-01-31 17:35:02.
If a userid visits a webid and jump to another webid and return back to the first webid, the first_ts will be refreshed to current timestamp. For example, row 8,9,10,11,12,13,14,15,16,17,18. userid A visits webid=1 at first, but jump to webid=77,15,44, after return back to webid=1, then row 12, row 13 and row 14 should be recorded as a new session with current first_ts.
If a userid visits a webid at consecutive timestamps, and one timestamp interval is greater than 30 minutes, it will also break the session and the first_ts will be refreshed as well. For example, row 19,20,21,22,23. Between row 21 and row 22, their time interval are over 30 minutes, so for row 22 and row 23, their first_ts refreshed to current timestamp and they are to be a new session.
I share my script here, the script includes rowid for view easily and it seems like it return correct first_ts, but session_id is wrong.
SQL version:

create temp table lags as
select *, lag(webid,1,webid) over(partition by userid order by ts) lag_webid,
lag(ts,1,ts) over(partition by userid order by ts) lag_ts 
from mytable;

create temp table is_sessions as 
select *,
case when webid!=lag_webid or abs(datediff(minutes,ts,lag_ts)) > 30 then 1
     else 0
end as is_session
from lags;

create temp table sessions as 
select *,
sum(is_session) over(partition by userid order by ts)+1 as session_id
from is_sessions;

create temp table sessions_ts as 
select *,
min(ts) over(partition by userid, session_id order by ts) as first_ts
from sessions;

Equivalent PostgreSQL version:

select rowid, userid, webid, session_id, ts, min(ts) over(partition by userid, session_id order by ts) first_ts
from (
    select t.*,
        1 + count(*) 
            filter(where webid != lag_webid or ts > lag_ts + interval '30' minute) 
            over(partition by userid order by ts) session_id
    from (
        select t.*, 
            lag(webid, 1, webid) over(partition by userid order by ts) lag_webid,
            lag(ts, 1, ts)       over(partition by userid order by ts) lag_ts
        from mytable t
    ) t
) t

Anyone has a better solution to do the task? Thanks!
script reference: https://dbfiddle.uk/-c3ZNO65

qfe3c7zg

qfe3c7zg1#

这在MySQL中运行良好。您只需要将filter替换为条件计数。您的错误是由于分区中的多个行具有相同的时间戳,并且没有指定任何内容来打破死锁。将rowid添加到框架定义中的ORDER BY子句可以在排序时区分行。

SELECT rowid, userid, webid, session_id, ts,
    MIN(ts) OVER (PARTITION BY userid, session_id ORDER BY ts, rowid) first_ts
FROM (
    SELECT *,
        1 + COUNT(CASE WHEN webid <> prev_webid OR ts > prev_ts + INTERVAL '30' MINUTE THEN 1 END)
                OVER (PARTITION BY userid ORDER BY ts, rowid) session_id
    FROM (
        SELECT *,
            LAG(webid) OVER (PARTITION BY userid ORDER BY ts, rowid) prev_webid,
            LAG(ts)    OVER (PARTITION BY userid ORDER BY ts, rowid) prev_ts
        FROM myTable
    ) t
) t

这是您更新的db<>fiddle

相关问题