hiveql:如何在array< string>

des4xlb0  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(618)

我正在hiveql中创建一个表和一列 duplicate_set 应该是一个数组,其中包含来自另一列的列表中的重复元素集 list . 例如给定一个表

  1. +-----------+-------------------------+----------------------+
  2. | id | list | duplicate_set |
  3. +-----------+-------------------------+----------------------+
  4. | 1 | ["1","2","2","3","3"] | ["2","3"] |
  5. +-----------+-------------------------+----------------------+
  6. | 2 | ["2","2","5","6"] | ["2"] |
  7. +-----------+-------------------------+----------------------+
  8. | 3 | ["2","4","5","6"] | [] |
  9. ...

提取重复元素并将它们放入一个集合的最佳方法是什么?是否有任何现有的自定义项?谢谢。

a1o7rhls

a1o7rhls1#

你可以分解数组,计算 row_number ,然后将重复的元素(行数大于1)聚合到集合中:

  1. with initial_data as (
  2. select 1 id ,array("1","2","2","3","3") list union all
  3. select 2 ,array("2","2","5","6") list union all
  4. select 3 ,array("2","4","5","6")
  5. )
  6. select s.id, s.list, collect_set(case when s.rn>1 then x end) duplicate_set
  7. from(
  8. select s.id, s.list, l.x, row_number() over(partition by id, l.x) as rn
  9. from initial_data s
  10. lateral view explode(list) l as x --array element x
  11. ) s
  12. group by s.id, s.list;

结果:

  1. id list duplicate_set
  2. 1 ["1","2","2","3","3"] ["2","3"]
  3. 2 ["2","2","5","6"] ["2"]
  4. 3 ["2","4","5","6"] []
展开查看全部

相关问题