如何使用配置单元查询3个大表中的相交值?

rwqw0loc  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(392)

我有3个非常大的表*的ip地址,并试图计数的数量共同的ip之间的3个表。我考虑过使用联接和子查询来查找这3个表之间IP的交集。如何用一个查询找到所有3个表的交集?
这是不正确的语法,但说明了我要实现的目标:

  1. SELECT COUNT(DISTINCT(a.ip)) FROM a, b, c WHERE a.ip = b.ip = c.ip

我已经看到了关于如何连接3个表的其他答案,但是对于hive和这个比例都没有。

  • 注意事项:

表a:70亿行
表b:18亿行
表c:1.68亿行
“表”实际上是由s3支持的配置单元元存储。
每个表中都有许多重复的IP
欢迎提出绩效建议。
如果使用sparksql而不是hive是一个更好的主意,那么也可以运行sparksql查询。

5lwkijsr

5lwkijsr1#

正确的语法是:

  1. SELECT COUNT(DISTINCT a.ip)
  2. FROM a JOIN
  3. b
  4. ON a.ip = b.ip JOIN
  5. c
  6. ON a.ip = c.ip;

这可能在我们有生之年不会结束。更好的方法是:

  1. select ip
  2. from (select distinct a.ip, 1 as which from a union all
  3. select distinct b.ip, 2 as which from b union all
  4. select distinct c.ip, 3 as which from c
  5. ) abc
  6. group by ip
  7. having sum(which) = 6;

我承认 sum(which) = 6 只是说三者都存在。因为 select distinct 在子查询中,您只需执行以下操作:

  1. having count(*) = 3
展开查看全部
vyu0f0g1

vyu0f0g12#

简单的解决方案:

  1. select count(*)
  2. from (select 1
  3. from (
  4. select 'a' as tab,ip from a
  5. union all select 'b' as tab,ip from b
  6. union all select 'c' as tab,ip from c
  7. ) t
  8. group by ip
  9. having count(case when tab = 'a' then 1 end) > 0
  10. and count(case when tab = 'b' then 1 end) > 0
  11. and count(case when tab = 'c' then 1 end) > 0
  12. ) t

这将不仅为您提供有关3个表交集(in_a=1、in_b=1、in_c=1)的信息,而且还提供有关所有其他组合的信息:

  1. select in_a
  2. ,in_b
  3. ,in_c
  4. ,count(*) as ips
  5. from (select max(case when tab = 'a' then 1 end) as in_a
  6. ,max(case when tab = 'b' then 1 end) as in_b
  7. ,max(case when tab = 'c' then 1 end) as in_c
  8. from (
  9. select 'a' as tab,ip from a
  10. union all select 'b' as tab,ip from b
  11. union all select 'c' as tab,ip from c
  12. ) t
  13. group by ip
  14. ) t
  15. group by in_a
  16. ,in_b
  17. ,in_c

... 还有更多信息:

  1. select sign(cnt_a) as in_a
  2. ,sign(cnt_b) as in_b
  3. ,sign(cnt_c) as in_c
  4. ,count(*) as unique_ips
  5. ,sum(cnt_total) as total_ips
  6. ,sum(cnt_a) as total_ips_in_a
  7. ,sum(cnt_b) as total_ips_in_b
  8. ,sum(cnt_c) as total_ips_in_c
  9. from (select count(*) as cnt_total
  10. ,count(case when tab = 'a' then 1 end) as cnt_a
  11. ,count(case when tab = 'b' then 1 end) as cnt_b
  12. ,count(case when tab = 'c' then 1 end) as cnt_c
  13. from (
  14. select 'a' as tab,ip from a
  15. union all select 'b' as tab,ip from b
  16. union all select 'c' as tab,ip from c
  17. ) t
  18. group by ip
  19. ) t
  20. group by sign(cnt_a)
  21. ,sign(cnt_b)
  22. ,sign(cnt_c)
展开查看全部

相关问题