如何使用配置单元查询3个大表中的相交值？

rwqw0loc 于 2021-05-29 发布在 Hadoop

关注(0)|答案(2)|浏览(401)

我有3个非常大的表*的ip地址，并试图计数的数量共同的ip之间的3个表。我考虑过使用联接和子查询来查找这3个表之间IP的交集。如何用一个查询找到所有3个表的交集？
这是不正确的语法，但说明了我要实现的目标：

SELECT COUNT(DISTINCT(a.ip)) FROM a, b, c WHERE a.ip = b.ip = c.ip

我已经看到了关于如何连接3个表的其他答案，但是对于hive和这个比例都没有。

注意事项：

表a:70亿行
表b:18亿行
表c:1.68亿行
“表”实际上是由s3支持的配置单元元存储。
每个表中都有许多重复的IP
欢迎提出绩效建议。
如果使用sparksql而不是hive是一个更好的主意，那么也可以运行sparksql查询。

sql hadoop Hive apache-spark

来源：https://stackoverflow.com/questions/45380093/how-can-i-query-3-large-tables-for-intersecting-values-using-hive

2条答案

按热度按时间

5lwkijsr1#

正确的语法是：

SELECT COUNT(DISTINCT a.ip)
FROM a JOIN
     b
     ON a.ip = b.ip JOIN
     c
     ON a.ip  = c.ip;

这可能在我们有生之年不会结束。更好的方法是：

select ip
from (select distinct a.ip, 1 as which from a union all
      select distinct b.ip, 2 as which from b union all
      select distinct c.ip, 3 as which from c
     ) abc
group by ip
having sum(which) = 6;

我承认 sum(which) = 6 只是说三者都存在。因为 select distinct 在子查询中，您只需执行以下操作：

having count(*) = 3

展开查看全部

赞(0）回复(0）举报 2021-05-29

vyu0f0g12#

简单的解决方案：

select      count(*)
from       (select      1
            from        (
                                    select 'a' as tab,ip from a
                        union all   select 'b' as tab,ip from b
                        union all   select 'c' as tab,ip from c
                        ) t
            group by    ip
            having      count(case when tab = 'a' then 1 end) > 0
                    and count(case when tab = 'b' then 1 end) > 0
                    and count(case when tab = 'c' then 1 end) > 0
            ) t

这将不仅为您提供有关3个表交集（in_a=1、in_b=1、in_c=1）的信息，而且还提供有关所有其他组合的信息：

select      in_a
           ,in_b
           ,in_c
           ,count(*)    as ips
from       (select      max(case when tab = 'a' then 1 end)  as in_a
                       ,max(case when tab = 'b' then 1 end)  as in_b
                       ,max(case when tab = 'c' then 1 end)  as in_c
            from        (
                                    select 'a' as tab,ip from a
                        union all   select 'b' as tab,ip from b
                        union all   select 'c' as tab,ip from c
                        ) t
            group by    ip
            ) t
group by    in_a
           ,in_b
           ,in_c

... 还有更多信息：

select      sign(cnt_a)                 as in_a
           ,sign(cnt_b)                 as in_b
           ,sign(cnt_c)                 as in_c
           ,count(*)                    as unique_ips
           ,sum(cnt_total)              as total_ips
           ,sum(cnt_a)                  as total_ips_in_a
           ,sum(cnt_b)                  as total_ips_in_b
           ,sum(cnt_c)                  as total_ips_in_c
from       (select      count(*)                                as cnt_total
                       ,count(case when tab = 'a' then 1 end)   as cnt_a
                       ,count(case when tab = 'b' then 1 end)   as cnt_b
                       ,count(case when tab = 'c' then 1 end)   as cnt_c
            from        (
                                    select 'a' as tab,ip from a
                        union all   select 'b' as tab,ip from b
                        union all   select 'c' as tab,ip from c
                        ) t
            group by    ip
            ) t
group by    sign(cnt_a)
           ,sign(cnt_b)
           ,sign(cnt_c)

展开查看全部

赞(0）回复(0）举报 2021-05-29

我来回答

如何使用配置单元查询3个大表中的相交值？

2条答案

相关问题

热门标签

最新问答