hive表,带有orc文件,并对性能进行了额外的优化

l0oc07j2  于 2021-05-27  发布在  Hadoop
关注(0)|答案(0)|浏览(239)

我们在hive2中有一个表,hadoop2.7用zlib存储为orc。
一个字段只有25个值,因此我们尝试使用storageindexes、clusterization和bloomfilters来提高性能,但失败了。
您能否帮助我们理解,为什么查询非结构化和未排序的表比按聚类字段筛选查询聚类表更快。
举个例子:
基表

  1. create table sandbox_mr.profile2099_hdfs ( userid bigint, field_id int, field_boolean_value boolean, field_string_value string, field_int_value int, valid_from date )
  2. STORED AS ORC
  3. LOCATION '/user/dt=2099-12-31'

两类改进:

  1. CREATE TABLE `mr_metrics.ProfileFieldsShort2099_bloom_full_64buc`(
  2. `userid` bigint,
  3. `field_id` int,
  4. `field_boolean_value` boolean,
  5. `field_string_value` string,
  6. `field_int_value` int,
  7. `valid_from` date)
  8. CLUSTERED BY (
  9. field_id)
  10. SORTED BY (
  11. userid ASC)
  12. INTO 64 BUCKETS
  13. ROW FORMAT SERDE
  14. 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  15. STORED AS INPUTFORMAT
  16. 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  17. OUTPUTFORMAT
  18. 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
  19. LOCATION
  20. 'hdfs://datalab-hadoop-dwh-stable/user/hive/warehouse/mr_metrics.db/profilefieldsshort2099_bloom_full_64buc'
  21. TBLPROPERTIES (
  22. 'orc.bloom.filter.columns'='field_id',
  23. 'orc.bloom.filter.fpp'='0.05',
  24. 'orc.compress'='ZLIB',
  25. 'orc.create.index'='true',
  26. 'orc.row.index.stride'='10000',
  27. 'orc.stripe.size'='268435456',
  28. 'transient_lastDdlTime'='1583947386')
  29. CREATE TABLE `mr_metrics.ProfileFieldsShort2099_cluster`(
  30. `userid` bigint,
  31. `field_id` int,
  32. `field_boolean_value` boolean,
  33. `field_string_value` string,
  34. `field_int_value` int,
  35. `valid_from` date)
  36. CLUSTERED BY (
  37. field_id)
  38. SORTED BY (
  39. userid ASC)
  40. INTO 64 BUCKETS
  41. ROW FORMAT SERDE
  42. 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  43. STORED AS INPUTFORMAT
  44. 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  45. OUTPUTFORMAT
  46. 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
  47. LOCATION
  48. 'hdfs://datalab-hadoop-dwh-stable/user/hive/warehouse/mr_metrics.db/profilefieldsshort2099_cluster'
  49. TBLPROPERTIES (
  50. 'orc.compress'='ZLIB',
  51. 'orc.create.index'='true',
  52. 'orc.row.index.stride'='10000',
  53. 'orc.stripe.size'='268435456',
  54. 'transient_lastDdlTime'='1583946796')

测试查询:

  1. select count(*) from ... where field_id = 50

结果配置文件short2099\u bloom\u full\u 64buc 209.29 s配置文件short2099\u cluster 183.24 s配置文件2099\u hdfs 31.00 s
执行引擎=tez
或此查询:

  1. select count(*), count(distinct userid) from ... where field_id = 50

结果:profilefieldsshort2099\u bloom\u full\u 64buc 4485.022 s profilefieldsshort2099\u cluster 3084.809 s profile2099\u hdfs 1960.737 s

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题