hive查询优化

ccgok5k5  于 2021-06-27  发布在  Hive
关注(0)|答案(5)|浏览(410)

我的要求是获得学生的id和姓名有超过1个电子邮件id和类型=1。
我使用的查询

  1. select distinct b.id, b.name, b.email, b.type,a.cnt
  2. from (
  3. select id, count(email) as cnt
  4. from (
  5. select distinct id, email
  6. from table1
  7. ) c
  8. group by id
  9. ) a
  10. join table1 b on a.id = b.id
  11. where b.type=1
  12. order by b.id

请让我知道这是罚款或任何简单的版本可用。

  1. Sample data is like:
  2. id name email type
  3. 123 AAA abc@xyz.com 1
  4. 123 AAA acd@xyz.com 1
  5. 123 AAA ayx@xyz.com 3
  6. 345 BBB nch@xyz.com 1
  7. 345 BBB nch@xyz.com 1
  8. 678 CCC iuy@xyz.com 1
  9. Expected Output:
  10. 123 AAA abc@xyz.com 1 2
  11. 123 AAA acd@xyz.com 1 2
  12. 345 BBB nch@xyz.com 1 1
  13. 678 CCC iuy@xyz.com 1 1
tktrz96b

tktrz96b1#

这与您的查询非常相似,但在这里,我在初始步骤(在内部查询中)过滤数据,以便连接不会在较少的数据上发生

  1. select distinct b.id,b.name,b.email,b.type,intr_table.cnt from table1 orig_table join
  2. (
  3. select a.id,a.type,count(a.email) as cnt from table1 as a where a.type=1 group by a
  4. ) intr_table on inter_table.id=orig_table.id,inter_table.type=orig_table.type
km0tfn4u

km0tfn4u2#

还有一种方法 collect_set 取返回数组的大小计算不同的邮件。
演示:

  1. --your data example
  2. with table1 as ( --use your table instead of this
  3. select stack(6,
  4. 123, 'AAA', 'abc@xyz.com', 1,
  5. 123, 'AAA', 'acd@xyz.com', 1,
  6. 123, 'AAA', 'ayx@xyz.com', 3,
  7. 345, 'BBB', 'nch@xyz.com', 1,
  8. 345, 'BBB', 'nch@xyz.com', 1,
  9. 678, 'CCC', 'iuy@xyz.com', 1
  10. ) as (id, name, email, type )
  11. )
  12. --query
  13. select distinct id, name, email, type,
  14. size(collect_set(email) over(partition by id)) cnt
  15. from table1
  16. where type=1

结果:

  1. id name email type cnt
  2. 123 AAA abc@xyz.com 1 2
  3. 123 AAA acd@xyz.com 1 2
  4. 345 BBB nch@xyz.com 1 1
  5. 678 CCC iuy@xyz.com 1 1

我们在这里仍然需要distinct,因为解析函数不会像case那样删除重复项 345 BBB nch@xyz.com .

展开查看全部
omhiaaxx

omhiaaxx3#

你可以用 group by -> having count() 对于这个要求。

  1. select distinct b.id
  2. , b.name,
  3. , b.email
  4. , b.type
  5. from table1 b
  6. where id in
  7. (select distinct id from table1 group by email, id having count(email) > 1)
  8. and b.type=1
  9. order by b.id
xmd2e60i

xmd2e60i4#

您可以尝试使用count()函数的分析方式:

  1. SELECT sub.ID, sub.NAME
  2. FROM (SELECT ID, NAME, COUNT (*) OVER (PARTITION BY ID, EMAIL) cnt
  3. FROM raw.crddacia_raw) sub
  4. WHERE sub.cnt > 1 AND sub.TYPE = 1
wmomyfyw

wmomyfyw5#

我强烈建议使用窗口函数。但是,hive不支持 count(distinct) 作为一个窗口函数。有不同的方法来解决这个问题。一个是 dense_rank() 学生:

  1. select id, name, email, type, cnt
  2. from (select t1.*,
  3. (dense_rank() over (partition by id order by email) +
  4. dense_rank() over (partition by id order by email desc)
  5. ) as cnt
  6. from table1 t1
  7. ) t
  8. where type = 1;

我希望这比你的版本有更好的性能。但是,值得测试不同的版本,看看哪个版本的性能更好(并且可以随时回来让其他人知道哪个更好)。

相关问题