配置单元集合集()

lbsnaicq  于 2021-06-26  发布在  Hive
关注(0)|答案(4)|浏览(248)

假设我有两张table: timeperiod1 以及 timeperiod2 . timeperiod1 有这样一个模式:

cluster  characteristic
A        1
A        2
A        3
B        2
B        3
``` `timeperiod2` 有这样一个模式:

cluster characteristic
A 1
A 2
B 2
B 3
B 4

我想按集群计算两个时间段(即表)之间的集合差。我的计划(请让我知道任何更好的方法)是1)收集集合(我知道如何做),然后2)比较集合的差异(我不知道如何做)。
1) 我知道:

CREATE TABLE collect_char_wk1 STORED AS ORC AS
SELECT cluster, COLLECT_SET(characteristic)
FROM timeperiod1
GROUP BY cluster;

CREATE TABLE collect_char_wk2 STORED AS ORC AS
SELECT cluster, COLLECT_SET(characteristic)
FROM timeperiod2
GROUP BY cluster;

得到 `collect_char_wk1` :

cluster characteristic
A [1,2,3]
B [2,3]

为了得到 `collect_char_wk2` :

cluster characteristic
A [1,2]
B [2,3,4]

2) 有没有一个Hive函数可以用来计算集差?我对java不够熟悉,无法编写自己的set\u diff()配置单元udf/udaf。
结果应该是一张表, `set_diff_wk1_to_wk2` :

cluster set_diff
A 1
B 0

上面是一个玩具的例子,我的实际数据是在数百亿行和几个列的规模上,因此需要一个计算效率高的解决方案。我的数据存储在hdfs中,我使用的是hiveql+python。
ipakzgxi

ipakzgxi1#

如果您试图获得period1中每个集群中不在period2中的特征数,那么只需使用 left join 以及 group by .

select t1.cluster, count(case when t2.characteristic is null then 1 end) as set_diff
from timeperiod1 t1
left join timeperiod2 t2 on t1.cluster=t2.cluster and t1.characteristic=t2.characteristic
group by t1.cluster
qvtsj1bj

qvtsj1bj2#

select      cluster

           ,count(*)                                          as count_total_characteristic 
           ,count(case when in_1 = 1 and in_2 = 1 then 1 end) as count_both_1_and_2
           ,count(case when in_1 = 1 and in_2 = 0 then 1 end) as count_only_in_1
           ,count(case when in_1 = 0 and in_2 = 1 then 1 end) as count_only_in_2

           ,sort_array(collect_list(case when in_1 = 1 and in_2 = 1 then characteristic end)) as both_1_and_2
           ,sort_array(collect_list(case when in_1 = 1 and in_2 = 0 then characteristic end)) as only_in_1
           ,sort_array(collect_list(case when in_1 = 0 and in_2 = 1 then characteristic end)) as only_in_2

from       (select      cluster
                       ,characteristic
                       ,max(case when tab = 1 then 1 else 0 end) as in_1
                       ,max(case when tab = 2 then 1 else 0 end) as in_2

            from        (           select 1 as tab,cluster,characteristic from timeperiod1
                        union all   select 2 as tab,cluster,characteristic from timeperiod2
                        ) t

            group by    cluster
                       ,characteristic
            ) t

group by    cluster

order by    cluster
;
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+
| cluster | count_total_characteristic | count_both_1_and_2 | count_only_in_1 | count_only_in_2 | both_1_and_2 | only_in_1 | only_in_2 |
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+
| A       |                          3 |                  2 |               1 |               0 | [1,2]        | [3]       | []        |
| B       |                          3 |                  2 |               0 |               1 | [2,3]        | []        | [4]       |
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+
s6fujrry

s6fujrry3#

您可以使用brickhouse udf,它有许多函数来执行您描述的操作。更具体地说,您可以使用wiki中解释的set_diff。自述文件将指导您如何创建jar文件。
您可以在查询中包含jar文件:

ADD jar /PATH/TO/JARFILE/brickhouse-<VERSIONS>-SNAPSHOT.jar

然后使用本指南访问以下功能:https://github.com/klout/brickhouse/blob/master/src/main/resources/brickhouse.hql
希望这有帮助。

vnjpjtjt

vnjpjtjt4#

SELECT 
 t1.cluster t1_cluster, t2.cluster t2_cluster,
 COLLECT_SET(t1.characteristic) as t1_set, 
 COLLECT_SET(t2.characteristic) as t2_set,
 (SIZE(COLLECT_SET(t1.characteristic)) - 
  SIZE(COLLECT_SET(t2.characteristic))) 
 as set_diff
FROM timeperiod1 t1
INNER JOIN timeperiod2 t2 ON (t1.cluster=t2.cluster)
GROUP BY t1.cluster, t2.cluster;

这会给出集合中的差异,但是,您需要一个python函数来返回集合中实际缺少的值。希望这有帮助

相关问题