hive SQL:查询具有年/月/日分区的时间段的数据

6ovsh4lw  于 2023-10-18  发布在  Hive
关注(0)|答案(3)|浏览(206)

我有一个AWS S3数据湖的parquet文件结构如下:

  1. s3://bucket/device/table_x/year=2000/month=01/day=02/xyz.parquet

我的目标是使用AWS Athena来查询数据,以便在Grafana Jmeter 板中显示。我面临的挑战是,为了创建任何时间段的动态面板,同时也利用我的分区,我需要找到一种方法来将我的数据限制在我的WHERE部分中的相关时间段内-但要以一种跨年,月和日的方式来实现-而不必根据查询构建SQL语句。
我现在最好的建议是在查询下面,这应该可以工作-但它很复杂。是否有关于这种声明的建议最佳做法?

  1. SELECT
  2. Count(a1) as AVG_a1
  3. FROM
  4. tbl_11111111_a
  5. WHERE
  6. (
  7. -- Same year, same month
  8. (year = 'START_YEAR' AND month = 'START_MONTH' AND day BETWEEN 'START_DAY' AND 'END_DAY')
  9. OR
  10. -- Same year, different months
  11. (year = 'START_YEAR' AND month = 'START_MONTH' AND day >= 'START_DAY')
  12. OR
  13. (year = 'START_YEAR' AND month > 'START_MONTH' AND month < 'END_MONTH' AND day BETWEEN '01' AND '31')
  14. OR
  15. (year = 'START_YEAR' AND month = 'END_MONTH' AND day <= 'END_DAY')
  16. OR
  17. -- Different years
  18. (year > 'START_YEAR' AND year < 'END_YEAR')
  19. OR
  20. (year = 'END_YEAR' AND month < 'END_MONTH' AND day BETWEEN '01' AND '31')
  21. OR
  22. (year = 'END_YEAR' AND month = 'END_MONTH' AND day <= 'END_DAY')
  23. )
  24. AND
  25. t BETWEEN TIMESTAMP 'START_YEAR-START_MONTH-START_DAY 00:00:00' AND TIMESTAMP 'END_YEAR-END_MONTH-END_DAY 00:00:00'
mv1qrgav

mv1qrgav1#

我最终使用了this AWS examplethis blog article中描述的方法。
具体来说,我按照最初列出的那样设置我的S3结构,除了我删除了hive符号(这可以保留,但我相信在这种情况下投影日期格式会变得更加混乱):

  1. s3://bucket/device/table_x/2000/01/02/xyz.parquet

有了这个,我设置我的表属性使用分区投影与以下设置:

  1. "projection.enabled" = "true",
  2. "projection.date_created.type" = "date",
  3. "projection.date_created.format" = "yyyy/MM/dd",
  4. "projection.date_created.range" = "2000/01/01,NOW",
  5. "projection.date_created.interval" = "1",
  6. "projection.date_created.interval.unit" = "DAYS",
  7. "storage.location.template" = "s3://bucket/device/table_x/${date_created}/"

在我的Table Schema中,我将JSON更新为以下内容:

  1. [
  2. {
  3. "Name": "t",
  4. "Type": "timestamp",
  5. "Comment": ""
  6. },
  7. {
  8. "Name": "a1",
  9. "Type": "double",
  10. "Comment": ""
  11. },
  12. {
  13. "Name": "a2",
  14. "Type": "double",
  15. "Comment": ""
  16. },
  17. {
  18. "Name": "date_created",
  19. "Type": "string",
  20. "Comment": "",
  21. "PartitionKey": "Partition (0)"
  22. }
  23. ]

在进行了这些更新之后,我现在可以使用pilcrow提出的更简单的分区结构来查询数据:

  1. SELECT
  2. Count(a1) as AVG_a1
  3. FROM
  4. tbl_11111111_a
  5. WHERE
  6. date_created BETWEEN '2000/01/01' AND '2000/01/02'

这里的重点是,我能够保留一个子文件夹嵌套的日期结构(我更喜欢用于非Athena数据库的目的-但仍然保留简单的SQL查询WHERE语句用于我的Athena查询,同时使用有效的分区投影。
还请注意,我相信pilcrow在分析我原来的年/月/日SQL查询效率低下方面是正确的,因为我发现当切换到与单日5分钟间隔相关的查询的新结构时,速度提高了66%。

展开查看全部
iszxjhcz

iszxjhcz2#

我有两个选择。
首先,理想情况下,repartition使用单个“YYYYMMDD”属性分区,而不是嵌套。对于更自然的查询,这是一个很好的实践,例如,“... WHERE ymd_partition BETWEEN 'START_YMD' AND 'END_YMD'“,没有所有那些布尔扭曲。
否则,继续使用当前的方法,它有一个可行的想法,但为了效率而收紧它。
我们希望根据输入范围应用三种情景(同一年/月、同一年、不同年)中的一种且仅一种;然而,正如所写,我们根据所有记录来评估每个场景的每个条件,而不管输入范围如何。(例如,[2023-01-31, 2023-02-1]的输入范围将检查 * 从1月起的每个parquet文件 *,因为倒数第二个OR条件,因此意味着我们依赖t BETWEEN...条件作为昂贵的支持。
解决方案不会是漂亮的。

  1. WHERE
  2. -- apply only when input range has same year, same month
  3. --
  4. (start_yr = end_yr AND start_mo = end_mo
  5. AND "year" = start_yr
  6. AND "month" = start_mo
  7. AND "day" BETWEEN start_day AND end_day)
  8. OR
  9. -- apply only when input range same year, different months
  10. --
  11. (start_yr = end_yr AND start_mo != end_mo
  12. AND "year" = start_yr
  13. AND ( ("month" = start_mo AND "day" >= start_day)
  14. OR
  15. -- "day BETWEEN 1 AND 31" is superfluous here
  16. ("month" > start_mo AND "month" < end_mo)
  17. OR
  18. ("month" = end_mo AND "day" <= end_day) ) )
  19. OR
  20. -- apply only when input range has different years
  21. --
  22. (start_yr != end_yr
  23. AND
  24. ( ("year" = start_yr AND ( ("month" = start_mo AND "day" >= start_day)
  25. OR
  26. ("month" > start_mo) ) )
  27. OR
  28. ("year" > start_yr AND "year" < end_yr)
  29. OR
  30. ("year" = end_yr AND ( ("month" < end_mo )
  31. OR
  32. ("month" = end_mo AND "day" <= end_day) ) ) ) )

我可能漏掉了一个括号。
如果需要,可以重新添加t BETWEEN逻辑。

展开查看全部
1mrurvl1

1mrurvl13#

我现在最好的建议是在查询下面,这应该可以工作-但它很复杂。是否有关于这种声明的建议最佳做法?
我认为您还可以使用CASE表达式将数据过滤到START_YEARSTART_MONTHSTART_DAYEND_YEAREND_MONTHEND_DAY参数指定的时间段。

  1. SELECT
  2. Count(a1) as AVG_a1
  3. FROM
  4. tbl_11111111_a
  5. WHERE
  6. CASE
  7. WHEN year = START_YEAR AND month = START_MONTH AND day BETWEEN START_DAY AND END_DAY THEN 1
  8. WHEN year = START_YEAR AND month = START_MONTH AND day >= START_DAY THEN 1
  9. WHEN year = START_YEAR AND month > START_MONTH AND month < END_MONTH AND day BETWEEN '01' AND '31' THEN 1
  10. WHEN year = START_YEAR AND month = END_MONTH AND day <= END_DAY THEN 1
  11. WHEN year > START_YEAR AND year < END_YEAR THEN 1
  12. WHEN year = END_YEAR AND month < END_MONTH AND day BETWEEN '01' AND '31' THEN 1
  13. WHEN year = END_YEAR AND month = END_MONTH AND day <= END_DAY THEN 1
  14. ELSE 0
  15. END = 1
  16. AND
  17. t BETWEEN TIMESTAMP CONCAT(START_YEAR, '-', START_MONTH, '-', START_DAY, ' 00:00:00') AND TIMESTAMP CONCAT(END_YEAR, '-', END_MONTH, '-', END_DAY, ' 00:00:00')
展开查看全部

相关问题