hive bucketing生成的文件比预期的多，为什么？

3lxsmp7m 于 2021-05-29 发布在 Hadoop

关注(0)|答案(2)|浏览(350)

我有一个分区和群集的配置单元表（使用配置单元1.2）：

hive> describe formatted myClusteredTable;

# col_name              data_type

utc_timestamp           timestamp
...
clusteredId             bigint

# Partition Information

# col_name              data_type

datePartition           string

# Detailed Table Information

Num Buckets:            100
Bucket Columns:         [clusteredId]
Sort Columns:           [Order(col:clusteredId, order:1), Order(col:utc_timestamp, order:1)]
Storage Desc Params:
    serialization.format    1

我把数据像这样插入其中：

SET hive.enforce.bucketing=true;
SET hive.enforce.sorting=true;
INSERT OVERWRITE TABLE myClusteredTable  PARTITION (datePartition)
SELECT   ...
 utcTimestamp,
 clusteredId,
 datePartition
FROM (
  ... subquery ...
  ) subquery
SORT BY datePartition, clusteredId, utcTimestamp;

我希望它为每个分区生成100个文件。相反，它产生了更多：

$ find /path/to/partition/dt=2017-01-01 -type f | wc -l
1425
$ find /path/to/partition/dt=2017-01-02 -type f | wc -l
1419
$ find /path/to/partition/dt=2017-01-03 -type f | wc -l
1427

请帮助我理解为什么会发生这种情况，以及我如何避免它。

hadoop Hive hiveql

来源：https://stackoverflow.com/questions/45556490/hive-bucketing-is-generating-more-files-than-expected-why