将hdfs目录和.tsv文件Map到配置单元

9q78igpj 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(359)

我把数据以.tsv格式输入hfds。我要把它们装进Hive里。我需要帮助。
hdfs中的数据如下：

/ad_data/raw/reg_logs/utc_date=2014-06-11/utc_hour=03

注意：数据每天和每小时都加载到hdfs目录/ad\u data/raw/reg\u logs中。
此hdfs目录中有3个.tsv文件：

funel1.tsv
funel2.tsv
funel3.tsv

每个.tsv文件有3列，用tab分隔，数据如下：

2344    -39 223
2344    -23 443
2394    -43 98
2377    -12 33
...
...

我想创建一个包含3列id int、region\u code int和count int的配置单元模式，与hdfs中的完全相同。如果可能的话，我想去掉那个负号，在Hive表里，但没什么大不了的。
我用schema创建了一个配置单元表：（如果我错了，请纠正我）

CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING, utc_hour STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/raw/reg_logs';

我只想把数据从hdfs复制到hive。我不想使用“将数据输入路径“..”加载到表注册表日志中”，因为我不想每天手动输入数据。我只想将配置单元表指向hdfs目录，这样它就可以自动获取每天的数据。
我怎样才能做到呢？请更正我的配置单元表架构，如果需要和方法来获取数据。

第二部分：
我想创建另一个表reg\u logs\u org，它将从reg\u logs填充。我需要把每件事从注册日志旁边的小时列注册日志组织。
我创建的架构是：

CREATE EXTERNAL TABLE IF NOT EXISTS reg_logs_org (
id int,
region_code int,
count int
)
PARTITIONED BY (utc_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/ad_data/reg_logs_org';

将数据从注册表日志插入注册表日志组织：

insert overwrite table reg_logs_org
select id, region_code, sum(count), utc_date
from 
reg_logs
group by 
utc_date, id, region_code

错误消息：

FAILED: SemanticException 1:23 Need to specify partition columns because the destination table is partitioned. Error encountered near token 'reg_logs_org'

==

Thank you,
Rio

hadoop Hive hdfs

来源：https://stackoverflow.com/questions/24193575/mapping-hdfs-directory-with-tsv-files-to-hive

1条答案

按热度按时间

o4hqfura1#

你很接近。最后一步是需要将分区信息添加到配置单元的元存储中。hive单独存储每个分区的位置，并且不会自动查找新分区。有两种方法可以添加分区：
每小时做一次 add partition 声明：

alter table reg_logs add partition(utc_date='2014-06-11', utc_hour='03')
location '/ad_data/raw/reg_logs/utc_date=2014-06-11/utc_hour=03';

每小时（或不经常）修一次table。这将扫描根表位置以查找尚未添加的分区。

msck repair table reg_logs;

第一种方法有点痛苦，但效率更高。第二种方法很简单，但每次都对所有分区进行完全扫描。
编辑：问题的后半部分：
您只需要添加一些语法，以便使用动态分区插入到表中。一般来说，它是：

insert overwrite [table] partition([partition column])
select ...

或者在你的情况下：

insert overwrite table reg_logs_org partition(utc_date)
select id, region_code, sum(count), utc_date
from 
reg_logs
group by 
utc_date, id, region_code

赞(0）回复(0）举报 2021-06-03

我来回答

将hdfs目录和.tsv文件Map到配置单元

1条答案

相关问题

热门标签

最新问答