如何聚合基于列的多个条件以获得更好的性能？

2jcobegt 于 2021-06-26 发布在 Hive

关注(0)|答案(1)|浏览(365)

我有一个十亿行的表格，数据格式如下-

id  col1 col2
1   100  21
1   110  22
1   120  21
1   20   35
2   230  22
2   2    22
3   456  31
3   30   21
3   2    31
4   200  33
5   45   34

我需要根据col2上的各种条件找到col1的最小值和最大值，并得到结果表。目前我正在使用左连接表本身，但这是没有效率的，它需要超过70分钟。
我现在运行的示例查询如下-

select distinct t.id, t1.m1 colA,t2.m2 colB,t3.m3 colC
from table1 t
left join (select id,min(col1) over (partition by id) m1  from table1 where col2=21) t1 on (t.id=t1.id)  
left join (select id,min(col1) over (partition by id) m2 from table1 where col2 in (22,23,34) ) t2 on (t.id=t2.id) 
left join (select id,max(col1) over (partition by id) m3 from table1 id where col2 in (21,33,22,35) )t3 on (t.id=t3.id)

在Hive1.2中，有没有更好的方法以更高效的方式实现相同的结果？
以上查询结果为：

id  colA    colB   colC 
1   100     110    120
2   NULL    2      230
3   30      NULL   30
4   NULL    NULL   200
5   NULL    45     NULL

注：col1实际上是一个时间戳

sql Hive

来源：https://stackoverflow.com/questions/52415870/how-to-aggregate-column-based-multiple-conditions-for-better-performance

1条答案

按热度按时间

e4yzc0pl1#

我建议使用“条件聚合”，这基本上意味着在聚合函数中放置一个case表达式：

select
      t.id
    , max(case when col2=21 then t.col1 end)              colA
    , min(case when col2 in (22,23,34) then t.col1 end)   colB
    , max(case when col2 in (21,33,22,35 then t.col1 end) colC
from table1 t
group by t.id

这将减少通过多个左联接的源表的传递。
还要注意的是，虽然“选择不同的”可能产生了想要的结果，但这是一个“昂贵”的选择。 GROUP BY 还生成唯一的行，但同时提供聚合的能力。

赞(0）回复(0）举报 2021-06-26

我来回答

如何聚合基于列的多个条件以获得更好的性能？

1条答案

相关问题

热门标签

最新问答