如何根据子组中的其他值对子组中的记录进行有效分类?

fae0ux8s  于 2021-07-26  发布在  Java
关注(0)|答案(2)|浏览(313)

我有一张table看起来像下面。记录按用户id和事件时间排序。

Row    User_ID    Event_Time    Event_Type    
1      1          2020-01-01    View
2      1          2020-01-02    Click
3      1          2020-01-03    Purchase
4      2          2020-02-01    View
5      2          2020-02-02    Click
6      2          2020-02-03    View
7      2          2020-02-04    Purchase
8      2          2020-02-11    View
9      2          2020-02-12    Purchase
10     2          2020-02-21    View
11     2          2020-02-22    Click
12     2          2020-02-23    Purchase
13     2          2020-02-27    View
14     2          2020-02-28    Click
15     3          2020-03-01    View
16     3          2020-03-02    Purchase
...

我想添加一个名为path的新列来对非购买事件进行分类。一个用户的每个非购买事件都“属于”同一个用户紧接着发生的即时购买事件,这意味着它们可以被视为一个子组。在每个分组中:
第一个非购买事件是介绍人(第1、4、10行)
最后一个非购买事件更接近(第2、6、11行)
介绍人和关系密切者之间的所有非购买事件都是影响者(第5行)
如果一个采购事件只有一个与之分组的非采购事件,则该非采购事件仅为(第8、15行)
填写采购事件为空(第3、7、9、12、16行)
如果非采购事件不属于任何采购事件(第13、14行),则填充null
因此,在添加列之后,表应该如下所示:

Row    User_ID    Event_Time    Event_Type    Path
1      1          2020-01-01    View          Introducer
2      1          2020-01-02    Click         Closer
3      1          2020-01-03    Purchase      NULL
4      2          2020-02-01    View          Introducer
5      2          2020-02-02    Click         Influencer
6      2          2020-02-03    View          Closer
7      2          2020-02-04    Purchase      NULL
8      2          2020-02-11    View          Only
9      2          2020-02-12    Purchase      NULL
10     2          2020-02-21    View          Introducer
11     2          2020-02-22    Click         Closer
12     2          2020-02-23    Purchase      NULL
13     2          2020-02-27    View          NULL
14     2          2020-02-28    Click         NULL
15     3          2020-03-01    View          Only
16     3          2020-03-02    Purchase      NULL
...

如果我自己加入并添加一个新的列来帮助识别每个事件用户最后一次购买的时间,那么解决方案就很简单了。但是,我有超过1亿条记录,而且self-join不够有效。行刑最终会超时。所以我的问题是,有没有更有效的方法来添加这个新专栏?我在考虑使用相关查询,但似乎无法将我的头围绕它。

vzgqcmou

vzgqcmou1#

这与尼克的做法相似,但我认为逻辑更简单:

WITH e AS (
      SELECT e.*,
             SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
                 (PARTITION BY User_ID ORDER BY Event_Time DESC) AS grp
      FROM events e
     ),
     en as (
      SELECT e.*,
             COUNT(*) OVER (PARTITION BY user_id, grp) as cnt,
             ROW_NUMBER() OVER (PARTITION BY user_id, grp ORDER BY Event_Time) as seqnum
      FROM e
     )
SELECT en.*,
       (CASE WHEN grp = 0                   -- no purchase event
             THEN NULL 
             WHEN Event_Type = 'Purchase'   -- the event itself
             THEN NULL
             WHEN seqnum = 1 AND cnt = 2    -- the special case of "ONLY" 
             THEN 'Only'
             WHEN seqnum = 1                -- The first event
             THEN 'Introducer'
             WHEN seqnum = cnt - 1          -- The penultimate event
             THEN 'Closer'
             ELSE 'Influencer'
        END) as Path
FROM en
ORDER BY User_ID, Event_Time;

特别是,外部查询中的子查询是不必要的。这个 grp = 0 查找可能没有购买的最后一组事件。我还认为用事件总数和顺序计数器来编写逻辑更容易。
这是一把小提琴。

z4iuyo4d

z4iuyo4d2#

如果您使用的是支持窗口函数的dbms,那么可以使用几个cte来首先将行拆分为不同的购买,然后找到与每个购买相关的行号,最后计算 Path 根据您给出的条件:

WITH purchases AS (
  SELECT "Row", User_ID, Event_Time, Event_Type,
         COALESCE(SUM(CASE WHEN Event_Type = 'Purchase' THEN 1 ELSE 0 END) OVER
           (PARTITION BY User_ID ORDER BY Event_Time ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) AS pnum
  FROM events
),
prows AS (
  SELECT "Row", User_ID, Event_Time, Event_Type, pnum,
         ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time) AS rn,
         ROW_NUMBER() OVER (PARTITION BY User_ID, pnum ORDER BY Event_Time DESC) AS drn
  FROM purchases
)
SELECT "Row", User_ID, Event_Time, Event_Type,
       CASE WHEN Event_Type = 'Purchase' OR
                 NOT EXISTS (SELECT * 
                             FROM prows r2 
                             WHERE r2.User_ID = r1.User_ID
                               AND r2.pnum = r1.pnum
                               AND r2.Event_Type = 'Purchase') THEN NULL
            WHEN rn = 1 AND drn = 2 THEN 'Only'
            WHEN rn = 1 THEN 'Introducer'
            WHEN drn = 2 THEN 'Closer'
            ELSE 'Influencer'
       END AS Path
FROM prows r1
ORDER BY User_ID, Event_Time

输出:

Row     User_ID     Event_Time  Event_Type  Path
1       1           2020-01-01  View        Introducer
2       1           2020-01-02  Click       Closer
3       1           2020-01-03  Purchase    (null)
4       2           2020-02-01  View        Introducer
5       2           2020-02-02  Click       Influencer
6       2           2020-02-03  View        Closer
7       2           2020-02-04  Purchase    (null)
8       2           2020-02-11  View        Only
9       2           2020-02-12  Purchase    (null)
10      2           2020-02-21  View        Introducer
11      2           2020-02-22  Click       Closer
12      2           2020-02-23  Purchase    (null)
13      2           2020-02-27  View        (null)
14      2           2020-02-28  Click       (null)
15      3           2020-03-01  View        Only
16      3           2020-03-02  Purchase    (null)

sqlfiddle上的sqlserver演示。同样的查询也将在postgresql和oracle上运行。

相关问题