执行最接近匹配时间的联接

uurity8g  于 2021-07-26  发布在  Java
关注(0)|答案(2)|浏览(400)

作为一个简化的场景,假设我有以下表格:
交易视图:

user_id | viewed_at           | more attributes ...
------------------------------
1       | 2020-07-12 15:00:00 | ...
1       | 2020-07-12 14:00:00 | ...
1       | 2020-07-12 13:00:00 | ...
1       | 2020-07-03 09:00:00 | ...
2       | 2020-07-12 15:00:00 | ...
2       | 2020-07-12 14:00:00 | ...
2       | 2020-07-12 13:00:00 | ...
3       | 2020-07-12 15:00:00 | ...

购买:

user_id | purchased_at        | more attributes ...
------------------------------
1       | 2020-07-03 09:02:10 | ...
1       | 2020-07-12 14:04:53 | ...
2       | 2020-07-12 15:20:12 | ...

(并非所有浏览交易的用户都会购买;一些用户在购买前会多次查看交易;一些用户会进行多次购买。每个购买的用户必须至少查看过一次交易;可能就在购买之前!)
我要的是精选的 deal_views (包括一些附加属性;为简单起见,上面省略了)为每个用户在购买前最近发生的。这里的动机是可见的交易随着时间的推移而改变,所以我们想知道“最近在购买之前,交易的状态如何?”。
因此,对于上述数据,这将是:

user_id | viewed_at           | more attributes ...
------------------------------
1       | 2020-07-03 09:00:00 | ...
1       | 2020-07-11 14:00:00 | ...
2       | 2020-07-11 15:00:00 | ...

sql性能会很好,但这不是一个大问题(只要它是可运行的!)因为这只是我希望执行的一次性查询。结果表中应该有大约200000行。

epfja78i

epfja78i1#

雪花支持横向连接,因此可以执行以下操作:

select p.*, d.*
from purchases p left join lateral
     (select d.*
      from deals d
      where d.user_id = p.user_id and
            d.viewed_at < p.purchased_at
      order by d.viewed_at desc
      limit 1
     ) d
     on 1=1;

你也可以 row_number() :

select pd.*
from (select p.*, d.*,   -- select the columns explicitly to avoid duplicate column names
            row_number() over (partition by p.user_id, p.purchased_at order by d.viewed_at desc) as seqnum
      from purchases p left join
           deals d
           on d.user_id = p.user_id and d.viewed_at < p.purchased_at 
     ) pd
where seqnum = 1;
up9lanfz

up9lanfz2#

(免责声明:我没有雪花,所以我没有尝试或测试。我在mssql上进行了测试。)

/*
DECLARE @deal_views AS TABLE (
    id int PRIMARY KEY IDENTITY(1,1), 
    [user_id] INT NOT NULL,
    viewed_at DATETIME NOT NULL);

DECLARE @purchases AS TABLE (
    id int PRIMARY KEY IDENTITY(1,1), 
    [user_id] INT NOT NULL,
    purchased_at DATETIME NOT NULL);

INSERT INTO @deal_views VALUES
    (1, '2020-07-12 15:00:00'), (1, '2020-07-12 14:00:00'), (1, '2020-07-12 13:00:00'),
    (1, '2020-07-03 09:00:00'),
    (2, '2020-07-12 15:00:00'), (2, '2020-07-12 14:00:00'), (2, '2020-07-12 13:00:00'),
    (3, '2020-07-12 15:00:00')

INSERT INTO @purchases VALUES
    (1, '2020-07-03 09:02:10'),
    (1, '2020-07-12 14:04:53'),
    (2, '2020-07-12 15:20:12')*/

SELECT p.[user_id], MAX(d.viewed_at) AS viewed_at
FROM purchases p
JOIN deal_views d
    ON  p.[user_id] = d.[user_id]
    AND p.purchased_at > d.viewed_at
GROUP BY p.[user_id], p.purchased_at

输出:

db小提琴演示
如果您需要来自deal\u视图的其他属性,请将其与deal视图连接起来。

SELECT d.id,
       d.user_id,
       d.viewed_at,
       d.attributes2, d.attributes3, d.attributes4
FROM (
    SELECT p.[user_id], MAX(d.viewed_at) AS viewed_at
    FROM purchases p
    JOIN deal_views d
        ON  p.[user_id] = d.[user_id]
        AND p.purchased_at > d.viewed_at
    GROUP BY p.[user_id], p.purchased_at) mv
JOIN deal_views d
    ON  mv.[user_id] = d.[user_id]
    AND mv.viewed_at > d.viewed_at

相关问题