我有一个数据集,其中将有一个多个记录 id
在其他列上分组的列字段。对于这个数据集,我只想为每个组的最新记录派生一个新列。我用的是 case statement
导出新列和 union
获取最新记录的值。我想避免使用 UNION
因为这是一个昂贵的操作 spark-sql
.
输入:
person_id order_id order_ts order_amt
1 1 2020-01-01 10:10:10 10
1 2 2020-01-01 10:15:15 15
2 3 2020-01-01 10:10:10 0
2 4 2020-01-01 10:15:15 15
根据上述输入,person_id 1有两个订单(1,2),person_id 2有两个订单(3,4)。我只想为给定人员的最新订单导出一列。
预期产量:
person_id order_id order_ts order_amt valid_order
1 1 2020-01-01 10:10:10 10 N
1 2 2020-01-01 10:15:15 15 Y
2 3 2020-01-01 10:10:10 0 N
2 4 2020-01-01 10:15:15 15 Y
我尝试在下面的查询中使用 UNION
在查询中:
select person_id, order_id, order_ts, order_amt, valid_order
from
(
select *, row_number() over(partition by order_id order by derive_order) as rnk
from
(
select person_id, order_id, order_ts, order_amt, 'N' as valid_order, 'before' as derive_order
from test_table
UNION
select person_id, order_id, order_ts, order_amt,
case when order_amt is not null and order_amt >0 then 'Y' else 'N' end as valid_order,
'after' as derive_order
from
(
select *, row_number() over(partition by person_id order by order_ts desc) as rnk
from test_table
) where rnk = 1
) final
) where rnk = 1 order by person_id, order_id;
我还使用了 left outer join
以及 inner join
.
联接查询:
select final.person_id, final.order_id, final.order_ts, final.order_amt,
case when final.valid_order is null then 'N' else final.valid_order end as valid_order
from
(
select c.person_id, c.order_id, c.order_ts, c.order_amt, d.valid_order from test_table c
left outer join
(
select a.*, case when a.order_amt is not null and a.order_amt >0 then 'Y' else 'N' end as valid_order
from test_table a
inner join
(
select person_id, max(order_id) as order_id from test_table group by 1
) b on a.person_id = b.person_id and a.order_id = b.order_id
) d on c.order_id = d.order_id
) final order by person_id, order_id;
我们的输入数据集将有大约2000万条记录。除了上述查询之外,还有没有更好的优化方法来获得相同的输出。
任何帮助都将不胜感激。
2条答案
按热度按时间ou6hu8tu1#
它可以在没有连接或并集的情况下完成。还有这种情况
a.order_amt is not null and a.order_amt >0
是多余的,因为如果amount>0,则它已经不是null。7dl7o3gd2#
检查是否有用-
使用spark dsl
使用spark sql