hadoop 提高这种窗口密集型查询的性能

3df52oht  于 2023-10-15  发布在  Hadoop
关注(0)|答案(1)|浏览(142)

下面是HQL中的代码。它从一个超过25亿行、大约334列的表中提取数据。它需要一段时间来运行,所以我正在寻找任何可能的性能提升。这是查询的主要块,还有后续的CTE执行一些进一步的处理,但下面是最昂贵的查询。
我知道窗口函数在这种情况下可能很重,但是它们是必需的,因为需要高度特定的时间戳计算。
任何帮助都非常感谢!

  1. WITH t1 AS (
  2. SELECT
  3. *,
  4. CASE
  5. WHEN (LEAD(hour) OVER (PARTITION BY id ORDER BY `time`) - hour) > 1
  6. THEN NULL
  7. WHEN (LEAD(hour) OVER (PARTITION BY id ORDER BY `time`) - hour) = 1
  8. THEN UNIX_TIMESTAMP(CONCAT(`date`, " ", LPAD((hour + 1), 2, 0), ":00:00"), 'dd/MM/yyyy HH:mm:ss') - UNIX_TIMESTAMP(CONCAT(`date`, " ", REGEXP_REPLACE(`time`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss')
  9. WHEN (LAG(s_id) OVER (PARTITION BY id ORDER BY `time`) = s_id) = TRUE
  10. AND (hour - LAG(hour) OVER (PARTITION BY id ORDER BY `time`)) = 1
  11. AND (LEAD(hour) OVER (PARTITION BY id ORDER BY `time`) - hour) = 0
  12. THEN (LEAD(UNIX_TIMESTAMP(CONCAT(`date`, " ", REGEXP_REPLACE(`time`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss'), 1) OVER (PARTITION BY id ORDER BY `time`) -
  13. UNIX_TIMESTAMP(CONCAT(`date`, " ", REGEXP_REPLACE(`time`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss') + LAG(hour_overlap_add) OVER (PARTITION BY id ORDER BY `time`))
  14. ELSE LEAD(unix_timestamp(CONCAT(`date`, " ", regexp_replace(`TIME`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss'), 1) OVER (PARTITION BY id, hour ORDER BY `time`)
  15. - unix_timestamp(CONCAT(`date`, " ", regexp_replace(`TIME`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss')
  16. END AS time_to_next_trans
  17. FROM(
  18. SELECT
  19. *,
  20. (overlap_time_to_next_trans - sec_between_current_trans_and_next_hr) as hour_overlap_add
  21. FROM(
  22. SELECT
  23. c_id,
  24. s_id,
  25. id,
  26. rat,
  27. dt,
  28. `date`,
  29. `time`,
  30. hour,
  31. (LEAD(s_id) OVER (PARTITION BY id ORDER BY `time`) = s_id) AS s_id_change,
  32. LEAD(hour) OVER (PARTITION BY id ORDER BY `time`) - hour AS difference_hour,
  33. UNIX_TIMESTAMP(CONCAT(`date`, " ", LPAD((hour + 1), 2, 0), ":00:00"), 'dd/MM/yyyy HH:mm:ss')
  34. - UNIX_TIMESTAMP(CONCAT(`date`, " ", REGEXP_REPLACE(`time`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss') AS sec_between_current_trans_and_next_hr,
  35. LEAD(UNIX_TIMESTAMP(CONCAT(`DATE`, " ", REGEXP_REPLACE(`TIME`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss'), 1) OVER (PARTITION BY id ORDER BY `time`) -
  36. UNIX_TIMESTAMP(CONCAT(`DATE`, " ", REGEXP_REPLACE(`TIME`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss') AS overlap_time_to_next_trans
  37. FROM database.tablename
  38. WHERE dt = "${rundate}” AND ID <> 0
  39. )j
  40. )o
  41. )
7dl7o3gd

7dl7o3gd1#

在步骤1中完成大部分推导。然后重用它们而不再次派生。

  1. with j as
  2. (
  3. SELECT
  4. c_id
  5. ,s_id
  6. ,id
  7. ,rat
  8. ,dt
  9. ,`date`
  10. ,`time`
  11. ,hour
  12. ,(LEAD(s_id) OVER (PARTITION BY id ORDER BY `time`) = s_id) AS s_id_change_lead
  13. ,(LAG(s_id) OVER (PARTITION BY id ORDER BY `time`) = s_id) AS s_id_change_lag
  14. ,LEAD(hour) OVER (PARTITION BY id ORDER BY `time`) - hour AS difference_hour
  15. ,hour - LAG(hour) OVER (PARTITION BY id ORDER BY `time`) as new_hour_diff
  16. ,UNIX_TIMESTAMP(CONCAT(`date`, " ", LPAD((hour + 1), 2, 0), ":00:00"), 'dd/MM/yyyy HH:mm:ss') - UNIX_TIMESTAMP(CONCAT(`date`, " ", REGEXP_REPLACE(`time`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss') AS sec_between_current_trans_and_next_hr,
  17. ,LEAD(UNIX_TIMESTAMP(CONCAT(`DATE`, " ", REGEXP_REPLACE(`TIME`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss'), 1) OVER (PARTITION BY id ORDER BY `time`) - UNIX_TIMESTAMP(CONCAT(`DATE`, " ", REGEXP_REPLACE(`TIME`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss') AS overlap_time_to_next_trans
  18. FROM database.tablename
  19. WHERE dt = "${rundate}" AND ID <> 0
  20. ),
  21. o as
  22. (
  23. SELECT
  24. *,
  25. overlap_time_to_next_trans - sec_between_current_trans_and_next_hr as hour_overlap_add
  26. FROM j
  27. ),
  28. t1 as
  29. (
  30. SELECT
  31. *
  32. ,CASE
  33. WHEN difference_hour > 1
  34. THEN NULL
  35. WHEN difference_hour = 1
  36. THEN sec_between_current_trans_and_next_hr
  37. WHEN s_id_change_lag = TRUE AND new_hour_diff=1 AND difference_hour = 0
  38. THEN overlap_time_to_next_trans + LAG(hour_overlap_add) OVER (PARTITION BY id ORDER BY `time`)
  39. ELSE LEAD(unix_timestamp(CONCAT(`date`, " ", regexp_replace(`TIME`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss'), 1) OVER (PARTITION BY id, hour ORDER BY `time`) - unix_timestamp(CONCAT(`date`, " ", regexp_replace(`TIME`, '\\.\\d+', '')), 'dd/MM/yyyy HH:mm:ss')
  40. END AS time_to_next_trans
  41. FROM o
  42. )
  43. select * from t1;

希望这对你有帮助。

展开查看全部

相关问题