我们在hive2中有一个表,hadoop2.7用zlib存储为orc。
一个字段只有25个值,因此我们尝试使用storageindexes、clusterization和bloomfilters来提高性能,但失败了。
您能否帮助我们理解,为什么查询非结构化和未排序的表比按聚类字段筛选查询聚类表更快。
举个例子:
基表
create table sandbox_mr.profile2099_hdfs ( userid bigint, field_id int, field_boolean_value boolean, field_string_value string, field_int_value int, valid_from date )
STORED AS ORC
LOCATION '/user/dt=2099-12-31'
两类改进:
CREATE TABLE `mr_metrics.ProfileFieldsShort2099_bloom_full_64buc`(
`userid` bigint,
`field_id` int,
`field_boolean_value` boolean,
`field_string_value` string,
`field_int_value` int,
`valid_from` date)
CLUSTERED BY (
field_id)
SORTED BY (
userid ASC)
INTO 64 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://datalab-hadoop-dwh-stable/user/hive/warehouse/mr_metrics.db/profilefieldsshort2099_bloom_full_64buc'
TBLPROPERTIES (
'orc.bloom.filter.columns'='field_id',
'orc.bloom.filter.fpp'='0.05',
'orc.compress'='ZLIB',
'orc.create.index'='true',
'orc.row.index.stride'='10000',
'orc.stripe.size'='268435456',
'transient_lastDdlTime'='1583947386')
CREATE TABLE `mr_metrics.ProfileFieldsShort2099_cluster`(
`userid` bigint,
`field_id` int,
`field_boolean_value` boolean,
`field_string_value` string,
`field_int_value` int,
`valid_from` date)
CLUSTERED BY (
field_id)
SORTED BY (
userid ASC)
INTO 64 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://datalab-hadoop-dwh-stable/user/hive/warehouse/mr_metrics.db/profilefieldsshort2099_cluster'
TBLPROPERTIES (
'orc.compress'='ZLIB',
'orc.create.index'='true',
'orc.row.index.stride'='10000',
'orc.stripe.size'='268435456',
'transient_lastDdlTime'='1583946796')
测试查询:
select count(*) from ... where field_id = 50
结果配置文件short2099\u bloom\u full\u 64buc 209.29 s配置文件short2099\u cluster 183.24 s配置文件2099\u hdfs 31.00 s
执行引擎=tez
或此查询:
select count(*), count(distinct userid) from ... where field_id = 50
结果:profilefieldsshort2099\u bloom\u full\u 64buc 4485.022 s profilefieldsshort2099\u cluster 3084.809 s profile2099\u hdfs 1960.737 s
暂无答案!
目前还没有任何答案,快来回答吧!