hive 获取每组的完整行集

idfiyjo8  于 2023-08-04  发布在  Hive
关注(0)|答案(2)|浏览(165)

编辑将扩大此范围以包括R标记

你好
我有一个按id和月份排序的表,与下面的表类似,除了它有数百万个id's和数百个var列,month列可以是2023年的任何一个月。我需要填写每个ID缺失的月份,并添加一列,指示该ID在特定月份是否处于活动状态。如果id在表中没有特定月份的记录,则假定它们处于非活动状态。所需表中的总行数将是12* 个不同的id。我还为示例输入添加了所需的输出。对于R解决方案,我更喜欢tidyverse解决方案,但data.table选项也可以工作,因为它会更快,我只是可能不擅长修改它。任何帮助都是感激的。
以下是R的dputs

  1. input =structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
  2. 2L, 2L, 2L, 2L, 2L, 3L, 3L), month = c(202301L, 202304L, 202305L,
  3. 202301L, 202302L, 202303L, 202304L, 202305L, 202306L, 202307L,
  4. 202308L, 202309L, 202310L, 202311L, 202312L, 202307L, 202308L
  5. ), var1 = c(1L, 2L, 3L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L,
  6. 14L, 15L, 16L, 17L, 18L), var2 = 1:17), class = "data.frame", row.names = c(NA,
  7. -17L))
  8. output = structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
  9. 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
  10. 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), month = c(202301L, 202302L,
  11. 202303L, 202304L, 202305L, 202306L, 202307L, 202308L, 202309L,
  12. 202310L, 202311L, 202312L, 202301L, 202302L, 202303L, 202304L,
  13. 202305L, 202306L, 202307L, 202308L, 202309L, 202310L, 202311L,
  14. 202312L, 202301L, 202302L, 202303L, 202304L, 202305L, 202306L,
  15. 202307L, 202308L, 202309L, 202310L, 202311L, 202312L), var1 = c(1L,
  16. NA, NA, 2L, 3L, NA, NA, NA, NA, NA, NA, NA, 5L, 6L, 7L, 8L, 9L,
  17. 10L, 11L, 12L, 13L, 14L, 15L, 16L, NA, NA, NA, NA, NA, NA, 17L,
  18. 18L, NA, NA, NA, NA), var2 = c(1L, NA, NA, 2L, 3L, NA, NA, NA,
  19. NA, NA, NA, NA, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
  20. 15L, NA, NA, NA, NA, NA, NA, 16L, 17L, NA, NA, NA, NA), active = c(1L,
  21. 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L,
  22. 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L,
  23. 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -36L))

字符串
样品输入:

  1. +----+--------+------+------+
  2. | id | month | var1 | var2 |
  3. +----+--------+------+------+
  4. | 1 | 202301 | 1 | 1 |
  5. | 1 | 202304 | 2 | 2 |
  6. | 1 | 202305 | 3 | 3 |
  7. | 2 | 202301 | 5 | 4 |
  8. | 2 | 202302 | 6 | 5 |
  9. | 2 | 202303 | 7 | 6 |
  10. | 2 | 202304 | 8 | 7 |
  11. | 2 | 202305 | 9 | 8 |
  12. | 2 | 202306 | 10 | 9 |
  13. | 2 | 202307 | 11 | 10 |
  14. | 2 | 202308 | 12 | 11 |
  15. | 2 | 202309 | 13 | 12 |
  16. | 2 | 202310 | 14 | 13 |
  17. | 2 | 202311 | 15 | 14 |
  18. | 2 | 202312 | 16 | 15 |
  19. | 3 | 202307 | 17 | 16 |
  20. | 3 | 202308 | 18 | 17 |
  21. +----+--------+------+------+


预期输出:

  1. +----+--------+------+------+--------+
  2. | id | month | var1 | var2 | active |
  3. +----+--------+------+------+--------+
  4. | 1 | 202301 | 1 | 1 | 1 |
  5. | 1 | 202302 | | | 0 |
  6. | 1 | 202303 | | | 0 |
  7. | 1 | 202304 | 2 | 2 | 1 |
  8. | 1 | 202305 | 3 | 3 | 1 |
  9. | 1 | 202306 | | | 0 |
  10. | 1 | 202307 | | | 0 |
  11. | 1 | 202308 | | | 0 |
  12. | 1 | 202309 | | | 0 |
  13. | 1 | 202310 | | | 0 |
  14. | 1 | 202311 | | | 0 |
  15. | 1 | 202312 | | | 0 |
  16. | 2 | 202301 | 5 | 4 | 1 |
  17. | 2 | 202302 | 6 | 5 | 1 |
  18. | 2 | 202303 | 7 | 6 | 1 |
  19. | 2 | 202304 | 8 | 7 | 1 |
  20. | 2 | 202305 | 9 | 8 | 1 |
  21. | 2 | 202306 | 10 | 9 | 1 |
  22. | 2 | 202307 | 11 | 10 | 1 |
  23. | 2 | 202308 | 12 | 11 | 1 |
  24. | 2 | 202309 | 13 | 12 | 1 |
  25. | 2 | 202310 | 14 | 13 | 1 |
  26. | 2 | 202311 | 15 | 14 | 1 |
  27. | 2 | 202312 | 16 | 15 | 1 |
  28. | 3 | 202301 | | | 0 |
  29. | 3 | 202302 | | | 0 |
  30. | 3 | 202303 | | | 0 |
  31. | 3 | 202304 | | | 0 |
  32. | 3 | 202305 | | | 0 |
  33. | 3 | 202306 | | | 0 |
  34. | 3 | 202307 | 17 | 16 | 1 |
  35. | 3 | 202308 | 18 | 17 | 1 |
  36. | 3 | 202309 | | | 0 |
  37. | 3 | 202310 | | | 0 |
  38. | 3 | 202311 | | | 0 |
  39. | 3 | 202312 | | | 0 |
  40. +----+--------+------+------+--------+

zz2j4svz

zz2j4svz1#

  1. with ids as (
  2. select distinct id from T
  3. ), months as (
  4. -- assuming there's at least one active id every month
  5. select distinct month from T
  6. -- where month between '202301' and '202312'
  7. )
  8. select i.id, m.month, t.var1, t.var2,
  9. case when t.id is null then 1 else 0 end as active
  10. from ids i cross join months m left outer join T t
  11. on t.id = i.id and t.month = m.month;

字符串
老实说,我真的不知道什么特别是关于Hive,但这是一般的想法,以“致密化”您的结果。缺失行的输出将为空。

rdlzhqv9

rdlzhqv92#

若要将缺少的月份添加到表中,可以将INSERT语句与生成缺少的月份的SELECT语句沿着使用。下面是一个将缺少的月份插入DOCTOR_CHECKUP表的示例:

  1. INSERT INTO DOCTOR_CHECKUP (CHECKED_DATE)
  2. SELECT TO_CHAR(ADD_MONTHS(DATE '2016-01-01', LEVEL - 1), 'MM-YYYY') MTH
  3. FROM DUAL
  4. CONNECT BY LEVEL <= 24
  5. MINUS
  6. SELECT TO_CHAR(TO_DATE(CHECKED_DATE, 'DD-MON-YY'), 'MM-YYYY')
  7. FROM DOCTOR_CHECKUP;

字符串
此查询将缺失的月份插入DOCTOR_CHECKUP表的CHECKED_DATE列。您可能需要调整列名和日期格式以匹配表结构。
可以使用子查询从同一表或不同表的另一列中选择数据,并将其包含在INSERT语句中。下面是一个例子:

  1. INSERT INTO DOCTOR_CHECKUP (CHECKED_DATE, ANOTHER_COLUMN)
  2. SELECT TO_CHAR(ADD_MONTHS(DATE '2016-01-01', LEVEL - 1), 'MM-YYYY') MTH, ANOTHER_COLUMN
  3. FROM ANOTHER_TABLE
  4. CONNECT BY LEVEL <= 24
  5. MINUS
  6. SELECT TO_CHAR(TO_DATE(CHECKED_DATE, 'DD-MON-YY'), 'MM-YYYY'), ANOTHER_COLUMN
  7. FROM DOCTOR_CHECKUP;


在此示例中,INSERT语句中包含ANOTHER_TABLE表中ANOTHER_COLUMN列的数据。可以将ANOTHER_TABLE替换为包含要包括的列的表的名称,将ANOTHER_COLUMN替换为要包括的列的名称。

展开查看全部

相关问题