我正在尝试运行一个连接到其他5个表的配置单元查询。其中一个表非常大(有150亿条记录),但由于join子句的原因,我实际上只从该表中寻找800万条记录。
我一直在cloudera的日志控制台中看到这一点。。。
INFO : 2015-09-10 09:51:43,209 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 437512.26 sec
我读过关于table歪斜,然后一个减速机是一个瓶颈,但我不知道如何检查一个表歪斜,如果是这样的情况。这就是问题所在吗?
编辑:
这是查询中的解释计划,表c是大表。。。。
STAGE DEPENDENCIES:
Stage-8 is a root stage , consists of Stage-1
Stage-1
Stage-9 depends on stages: Stage-1
Stage-6 depends on stages: Stage-9
Stage-0 depends on stages: Stage-6
STAGE PLANS:
Stage: Stage-8
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 8939332 Data size: 53635992 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: event_type_id (type: int)
sort order: +
Map-reduce partition columns: event_type_id (type: int)
Statistics: Num rows: 8939332 Data size: 53635992 Basic stats: COMPLETE Column stats: NONE
value expressions: marketing_contact_id (type: int), ent_cust_id (type: string), campaign_master_id (type: string), event_id (type: string), event_timestamp (type: string)
TableScan
alias: b
Statistics: Num rows: 6 Data size: 2750 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: event_type_id (type: int)
sort order: +
Map-reduce partition columns: event_type_id (type: int)
Statistics: Num rows: 6 Data size: 2750 Basic stats: COMPLETE Column stats: NONE
value expressions: event_type (type: string), event_type_category (type: string), event_type_subcategory (type: string), description (type: string)
TableScan
alias: c
filterExpr: (loaddate > 20150401) (type: boolean)
Statistics: Num rows: 4415906479 Data size: 185130335983 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: event_type_id (type: int)
sort order: +
Map-reduce partition columns: event_type_id (type: int)
Statistics: Num rows: 4415906479 Data size: 185130335983 Basic stats: COMPLETE Column stats: NONE
value expressions: display_event_id (type: string), category (type: string), time (type: string), user_id (type: string), ip (type: string), buy_id (type: double), ad_id (type: string), creative_id (type: string), creative_version (type: double), creative_size_id (type: string), site_id (type: string), page_id (type: string), keyword (type: string), country_id (type: double), state_province (type: string), browser_id (type: double), browser_version (type: double), os_id (type: double), local_user_id (type: string), sv1 (type: string), browser_type (type: string), country (type: string), os_type (type: string), state_province_name (type: string)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
Inner Join 0 to 2
keys:
0 event_type_id (type: int)
1 event_type_id (type: int)
2 event_type_id (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col11, _col12, _col13, _col14, _col22, _col23, _col24, _col25, _col26, _col27, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col41, _col42, _col43, _col48, _col58, _col59, _col60, _col61, _col62
Statistics: Num rows: 9714994464 Data size: 407286747990 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-9
Map Reduce Local Work
Alias -> Map Local Tables:
e
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
e
TableScan
alias: e
Statistics: Num rows: 7080 Data size: 3695175 Basic stats: PARTIAL Column stats: NONE
HashTable Sink Operator
keys:
0 _col30 (type: string)
1 ad_id (type: string)
Stage: Stage-6
Map Reduce
Map Operator Tree:
TableScan
Map Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 _col30 (type: string)
1 ad_id (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col11, _col12, _col13, _col14, _col22, _col23, _col24, _col25, _col26, _col27, _col29, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col41, _col42, _col43, _col48, _col58, _col59, _col60, _col61, _col62, _col67, _col68, _col69, _col70, _col72, _col73, _col74
Statistics: Num rows: 10686494142 Data size: 448015432499 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: int), _col11 (type: string), _col12 (type: string), _col13 (type: string), _col14 (type: string), _col41 (type: double), _col59 (type: string), _col42 (type: double), _col29 (type: double), _col23 (type: string), _col60 (type: string), _col37 (type: double), _col31 (type: string), _col33 (type: string), _col32 (type: double), _col22 (type: string), _col24 (type: int), _col27 (type: string), _col36 (type: string), _col48 (type: string), _col43 (type: double), _col61 (type: string), _col35 (type: string), _col34 (type: string), _col38 (type: string), _col62 (type: string), _col58 (type: string), _col25 (type: string), _col26 (type: string), _col67 (type: int), _col68 (type: int), _col69 (type: string), _col70 (type: string), _col72 (type: string), _col73 (type: string), _col74 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37, _col38, _col39, _col40
Statistics: Num rows: 10686494142 Data size: 448015432499 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 10686494142 Data size: 448015432499 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
1条答案
按热度按时间y0u0uwnf1#
如果有倾斜表,请尝试使用以下属性:hive.optimize.skewjoin=true
另外,你能粘贴的工作日志,这将使问题更清楚