配置单元集合集()

lbsnaicq 于 2021-06-26 发布在 Hive

关注(0)|答案(4)|浏览(248)

假设我有两张table： timeperiod1 以及 timeperiod2 . timeperiod1 有这样一个模式：

cluster  characteristic
A        1
A        2
A        3
B        2
B        3
``` `timeperiod2` 有这样一个模式：

cluster characteristic
A 1
A 2
B 2
B 3
B 4

我想按集群计算两个时间段（即表）之间的集合差。我的计划（请让我知道任何更好的方法）是1）收集集合（我知道如何做），然后2）比较集合的差异（我不知道如何做）。
1） 我知道：

CREATE TABLE collect_char_wk1 STORED AS ORC AS
SELECT cluster, COLLECT_SET(characteristic)
FROM timeperiod1
GROUP BY cluster;

CREATE TABLE collect_char_wk2 STORED AS ORC AS
SELECT cluster, COLLECT_SET(characteristic)
FROM timeperiod2
GROUP BY cluster;

得到 `collect_char_wk1` :

cluster characteristic
A [1,2,3]
B [2,3]

为了得到 `collect_char_wk2` :

cluster characteristic
A [1,2]
B [2,3,4]

2） 有没有一个Hive函数可以用来计算集差？我对java不够熟悉，无法编写自己的set\u diff（）配置单元udf/udaf。
结果应该是一张表， `set_diff_wk1_to_wk2` :

cluster set_diff
A 1
B 0

上面是一个玩具的例子，我的实际数据是在数百亿行和几个列的规模上，因此需要一个计算效率高的解决方案。我的数据存储在hdfs中，我使用的是hiveql+python。

sql Hive

来源：https://stackoverflow.com/questions/43054742/hive-collect-set

4条答案

按热度按时间

ipakzgxi1#

如果您试图获得period1中每个集群中不在period2中的特征数，那么只需使用 left join 以及 group by .

select t1.cluster, count(case when t2.characteristic is null then 1 end) as set_diff
from timeperiod1 t1
left join timeperiod2 t2 on t1.cluster=t2.cluster and t1.characteristic=t2.characteristic
group by t1.cluster

赞(0）回复(0）举报 2021-06-26

qvtsj1bj2#

select      cluster

           ,count(*)                                          as count_total_characteristic 
           ,count(case when in_1 = 1 and in_2 = 1 then 1 end) as count_both_1_and_2
           ,count(case when in_1 = 1 and in_2 = 0 then 1 end) as count_only_in_1
           ,count(case when in_1 = 0 and in_2 = 1 then 1 end) as count_only_in_2

           ,sort_array(collect_list(case when in_1 = 1 and in_2 = 1 then characteristic end)) as both_1_and_2
           ,sort_array(collect_list(case when in_1 = 1 and in_2 = 0 then characteristic end)) as only_in_1
           ,sort_array(collect_list(case when in_1 = 0 and in_2 = 1 then characteristic end)) as only_in_2

from       (select      cluster
                       ,characteristic
                       ,max(case when tab = 1 then 1 else 0 end) as in_1
                       ,max(case when tab = 2 then 1 else 0 end) as in_2

            from        (           select 1 as tab,cluster,characteristic from timeperiod1
                        union all   select 2 as tab,cluster,characteristic from timeperiod2
                        ) t

            group by    cluster
                       ,characteristic
            ) t

group by    cluster

order by    cluster
;

+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+
| cluster | count_total_characteristic | count_both_1_and_2 | count_only_in_1 | count_only_in_2 | both_1_and_2 | only_in_1 | only_in_2 |
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+
| A       |                          3 |                  2 |               1 |               0 | [1,2]        | [3]       | []        |
| B       |                          3 |                  2 |               0 |               1 | [2,3]        | []        | [4]       |
+---------+----------------------------+--------------------+-----------------+-----------------+--------------+-----------+-----------+

赞(0）回复(0）举报 2021-06-26

s6fujrry3#

您可以使用brickhouse udf，它有许多函数来执行您描述的操作。更具体地说，您可以使用wiki中解释的set_diff。自述文件将指导您如何创建jar文件。
您可以在查询中包含jar文件：

ADD jar /PATH/TO/JARFILE/brickhouse-<VERSIONS>-SNAPSHOT.jar

然后使用本指南访问以下功能：https://github.com/klout/brickhouse/blob/master/src/main/resources/brickhouse.hql
希望这有帮助。

赞(0）回复(0）举报 2021-06-26

vnjpjtjt4#

SELECT 
 t1.cluster t1_cluster, t2.cluster t2_cluster,
 COLLECT_SET(t1.characteristic) as t1_set, 
 COLLECT_SET(t2.characteristic) as t2_set,
 (SIZE(COLLECT_SET(t1.characteristic)) - 
  SIZE(COLLECT_SET(t2.characteristic))) 
 as set_diff
FROM timeperiod1 t1
INNER JOIN timeperiod2 t2 ON (t1.cluster=t2.cluster)
GROUP BY t1.cluster, t2.cluster;

这会给出集合中的差异，但是，您需要一个python函数来返回集合中实际缺少的值。希望这有帮助

赞(0）回复(0）举报 2021-06-26

我来回答

配置单元集合集()

4条答案

相关问题

热门标签

最新问答