问题是连接一个分区键上有超过2^31行的大分区。
(本文中的输出来自mapr的发行版,但这也是在apachehadoop/hive上复制的)
版本:hadoop-0.20.2 hive-0.10.0
当分区有超过2147483648行(甚至2147483649)时,联接的输出是一行。
当分区的行数少于2147483648行(事件2147483647)时,输出是正确的。
测试用例:
在值为“1”的分区中创建一个包含2147483649行的表,
将此表连接到另一个表,该表具有一行、一列,分区键上的值为“1”。
稍后删除2行并运行相同的连接。
第一:只创建一行
第二:2147483647行
create table max_sint_rows (s1 string)
partitioned by (p1 string)
ROW FORMAT DELIMITED
LINES TERMINATED BY '\n';
Create table small_table (p1 string)
ROW FORMAT DELIMITED
LINES TERMINATED BY '\n';
alter table max_sint_rows add partition (p1="1");
将2147483649个随机行写入max\u sint\u行。
将值“1”写入小表格。
create table output_rows_over as
select a.s1
from max_sint_rows a join small_table b
on (a.p1=b.p1);
在reducer的syslog中,我们得到以下输出:
INFO ExecReducer: ExecReducer: processing 2147000000 rows: used memory = 715266312
INFO org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1
INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 forwarding 1 rows
INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1 rows
INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Final Path: FS maprfs:/user/hadoop/tmp/hive/hive_2013-05-27_20-50-23_849_6140580929822990686/_tmp.-ext-10001/000004_1
INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Writing to temp file: FS maprfs:/user/hadoop/tmp/hive/hive_2013-05-27_20-50-23_849_6140580929822990686/_task_tmp.-ext-10001/_tmp.000004_1
INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: New Final Path: FS maprfs:/user/hadoop/tmp/hive/hive_2013-05-27_20-50-23_849_6140580929822990686/_tmp.-ext-10001/000004_1
INFO ExecReducer: ExecReducer: processed 2147483650 rows: used memory = 828336712
INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 finished. closing...
INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 forwarded 1 rows
INFO org.apache.hadoop.hive.ql.exec.JoinOperator: SKEWJOINFOLLOWUPJOBS:0
INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 finished. closing...
INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarded 1 rows
INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 7 finished. closing...
INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 7 forwarded 0 rows
INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:1
INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 Close done
INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 Close done
org.apache.hadoop.mapred.Task: Task:attempt_201305071944_2359_r_000004_1 is done. And is in the process of commiting
INFO org.apache.hadoop.mapred.Task: Task 'attempt_201305071944_2359_r_000004_1' done.
INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-
注意表\u id \u 1_rowcount:1 and 实际上,该表只有一个随机行。
现在从max\u sint\u行中删除2行并重新运行:
create table output_rows_under as
select a.s1
from max_sint_rows a join small_table b
on (a.p1=b.p1);
我们得到2147483647行的输出行,而reducer的syslog是:
INFO ExecReducer: ExecReducer: processed 2147483648 rows: used memory = 243494552
INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 finished. closing...
INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 forwarded 2147483647 rows
INFO org.apache.hadoop.hive.ql.exec.JoinOperator: SKEWJOINFOLLOWUPJOBS:0
INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 finished. closing...
INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarded 2147483647 rows
INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 7 finished. closing...
INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 7 forwarded 0 rows
INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:2147483647
INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 Close done
INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 Close done
INFO org.apache.hadoop.mapred.Task: Task:attempt_201305071944_2360_r_000004_0 is done. And is in the process of commiting
INFO org.apache.hadoop.mapred.Task: Task 'attempt_201305071944_2360_r_000004_0' done.
INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
暂无答案!
目前还没有任何答案,快来回答吧!