I have an activity table called myTable, the data looks like this:
CREATE TABLE myTable
(
userid text,
webid text,
"ts" timestamp
);
INSERT INTO myTable
("userid", "webid", "ts")
VALUES ('1', 'A', '34', '2023-01-31 16:34:49.000'),
('2', 'A', '73', '2023-01-31 16:34:50.000'),
('3', 'A', '97', '2023-01-31 16:34:58.000'),
('4', 'A', '17', '2023-01-31 17:35:02.000'),
('5', 'A', '17', '2023-01-31 17:35:07.000'),
('6', 'A', '17', '2023-01-31 17:35:18.000'),
('7', 'A', '17', '2023-01-31 17:35:30.000'),
('8', 'A', '1', '2023-01-31 17:35:37.000'),
('9', 'A', '1', '2023-01-31 17:35:38.000'),
('10', 'A', '77', '2023-01-31 17:35:41.000'),
('11', 'A', '77', '2023-01-31 17:35:42.000'),
('12', 'A', '15', '2023-01-31 17:35:42.000'),
('13', 'A', '44', '2023-01-31 17:35:42.000'),
('14', 'A', '1', '2023-01-31 17:35:42.000'),
('15', 'A', '77', '2023-01-31 17:35:45.000'),
('16', 'A', '44', '2023-01-31 17:35:45.000'),
('17', 'A', '1', '2023-01-31 17:37:10.000'),
('18', 'A', '1', '2023-01-31 17:37:12.000'),
('19', 'A', '77', '2023-01-31 17:37:14.000'),
('20', 'A', '77', '2023-01-31 17:52:14.000'),
('21', 'A', '77', '2023-01-31 18:12:14.000'),
('22', 'A', '77', '2023-01-31 18:45:14.000'),
('23', 'A', '77', '2023-01-31 18:55:15.000'),
('1', 'B', '33', '2023-01-31 06:37:15.000'),
('2', 'B', '56', '2023-01-31 06:40:15.000')
;
rowid | userid | webid | ts |
---|---|---|---|
1 | A | 34 | 2023-01-31 16:34:49 |
2 | A | 73 | 2023-01-31 16:34:50 |
3 | A | 97 | 2023-01-31 16:34:58 |
4 | A | 17 | 2023-01-31 17:35:02 |
5 | A | 17 | 2023-01-31 17:35:07 |
6 | A | 17 | 2023-01-31 17:35:18 |
7 | A | 17 | 2023-01-31 17:35:30 |
8 | A | 1 | 2023-01-31 17:35:37 |
9 | A | 1 | 2023-01-31 17:35:38 |
10 | A | 77 | 2023-01-31 17:35:41 |
11 | A | 77 | 2023-01-31 17:35:42 |
12 | A | 15 | 2023-01-31 17:35:42 |
13 | A | 44 | 2023-01-31 17:35:42 |
14 | A | 1 | 2023-01-31 17:35:42 |
15 | A | 77 | 2023-01-31 17:35:45 |
16 | A | 44 | 2023-01-31 17:35:45 |
17 | A | 1 | 2023-01-31 17:37:10 |
18 | A | 1 | 2023-01-31 17:37:12 |
19 | A | 77 | 2023-01-31 17:37:14 |
20 | A | 77 | 2023-01-31 17:52:14 |
21 | A | 77 | 2023-01-31 18:12:14 |
22 | A | 77 | 2023-01-31 18:45:14 |
23 | A | 77 | 2023-01-31 18:55:15 |
1 | B | 33 | 2023-01-31 06:37:15 |
2 | B | 56 | 2023-01-31 06:40:15 |
The output I want to return as following:
| rowid | userid | webid | session_id | ts | first_ts |
| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |
| 1 | A | 34 | 1 | 2023-01-31 16:34:49 | 2023-01-31 16:34:49 |
| 2 | A | 73 | 2 | 2023-01-31 16:34:50 | 2023-01-31 16:34:50 |
| 3 | A | 97 | 3 | 2023-01-31 16:34:58 | 2023-01-31 16:34:58 |
| 4 | A | 17 | 4 | 2023-01-31 17:35:02 | 2023-01-31 17:35:02 |
| 5 | A | 17 | 4 | 2023-01-31 17:35:07 | 2023-01-31 17:35:02 |
| 6 | A | 17 | 4 | 2023-01-31 17:35:18 | 2023-01-31 17:35:02 |
| 7 | A | 17 | 4 | 2023-01-31 17:35:30 | 2023-01-31 17:35:02 |
| 8 | A | 1 | 5 | 2023-01-31 17:35:37 | 2023-01-31 17:35:37 |
| 9 | A | 1 | 5 | 2023-01-31 17:35:38 | 2023-01-31 17:35:37 |
| 10 | A | 77 | 6 | 2023-01-31 17:35:41 | 2023-01-31 17:35:41 |
| 11 | A | 77 | 6 | 2023-01-31 17:35:42 | 2023-01-31 17:35:42 |
| 12 | A | 15 | 7 | 2023-01-31 17:35:42 | 2023-01-31 17:35:42 |
| 13 | A | 44 | 8 | 2023-01-31 17:35:42 | 2023-01-31 17:35:42 |
| 14 | A | 1 | 9 | 2023-01-31 17:35:42 | 2023-01-31 17:35:42 |
| 15 | A | 77 | 10 | 2023-01-31 17:35:45 | 2023-01-31 17:35:45 |
| 16 | A | 44 | 11 | 2023-01-31 17:35:45 | 2023-01-31 17:35:45 |
| 17 | A | 1 | 12 | 2023-01-31 17:37:10 | 2023-01-31 17:37:10 |
| 18 | A | 1 | 12 | 2023-01-31 17:37:12 | 2023-01-31 17:37:10 |
| 19 | A | 77 | 13 | 2023-01-31 17:37:14 | 2023-01-31 17:37:14 |
| 20 | A | 77 | 13 | 2023-01-31 17:52:14 | 2023-01-31 17:37:14 |
| 21 | A | 77 | 13 | 2023-01-31 18:12:14 | 2023-01-31 17:37:14 |
| 22 | A | 77 | 14 | 2023-01-31 18:45:14 | 2023-01-31 18:45:14 |
| 23 | A | 77 | 14 | 2023-01-31 18:55:15 | 2023-01-31 18:45:14 |
| 1 | B | 33 | 1 | 2023-01-31 06:37:15 | 2023-01-31 06:37:15 |
| 2 | B | 56 | 2 | 2023-01-31 06:40:15 | 2023-01-31 06:40:15 |
first_ts means first timestamp. The logic for first_ts is If a userid visits same webid at consecutive timestamp and each timestamp's interval falls within 30 minutes, these will be recorded as a session. For example, row 4,5,6,7. Each time interval are less than 30 minutes, their first_ts is the timestamp of the first event, which is 2023-01-31 17:35:02.
If a userid visits a webid and jump to another webid and return back to the first webid, the first_ts will be refreshed to current timestamp. For example, row 8,9,10,11,12,13,14,15,16,17,18. userid A visits webid=1 at first, but jump to webid=77,15,44, after return back to webid=1, then row 12, row 13 and row 14 should be recorded as a new session with current first_ts.
If a userid visits a webid at consecutive timestamps, and one timestamp interval is greater than 30 minutes, it will also break the session and the first_ts will be refreshed as well. For example, row 19,20,21,22,23. Between row 21 and row 22, their time interval are over 30 minutes, so for row 22 and row 23, their first_ts refreshed to current timestamp and they are to be a new session.
I share my script here, the script includes rowid for view easily and it seems like it return correct first_ts, but session_id is wrong.
SQL version:
create temp table lags as
select *, lag(webid,1,webid) over(partition by userid order by ts) lag_webid,
lag(ts,1,ts) over(partition by userid order by ts) lag_ts
from mytable;
create temp table is_sessions as
select *,
case when webid!=lag_webid or abs(datediff(minutes,ts,lag_ts)) > 30 then 1
else 0
end as is_session
from lags;
create temp table sessions as
select *,
sum(is_session) over(partition by userid order by ts)+1 as session_id
from is_sessions;
create temp table sessions_ts as
select *,
min(ts) over(partition by userid, session_id order by ts) as first_ts
from sessions;
Equivalent PostgreSQL version:
select rowid, userid, webid, session_id, ts, min(ts) over(partition by userid, session_id order by ts) first_ts
from (
select t.*,
1 + count(*)
filter(where webid != lag_webid or ts > lag_ts + interval '30' minute)
over(partition by userid order by ts) session_id
from (
select t.*,
lag(webid, 1, webid) over(partition by userid order by ts) lag_webid,
lag(ts, 1, ts) over(partition by userid order by ts) lag_ts
from mytable t
) t
) t
Anyone has a better solution to do the task? Thanks!
script reference: https://dbfiddle.uk/-c3ZNO65
1条答案
按热度按时间qfe3c7zg1#
这在MySQL中运行良好。您只需要将
filter
替换为条件计数。您的错误是由于分区中的多个行具有相同的时间戳,并且没有指定任何内容来打破死锁。将rowid
添加到框架定义中的ORDER BY
子句可以在排序时区分行。这是您更新的db<>fiddle