我在s3中有包含许多行json(用换行符分隔)的文件。我想把这些文件转换成一个列格式供aws雅典娜使用
为此,我遵循了《转换为列格式》指南,但是当转换为orc时,s3中的分区约定就丢失了。
在本例中,如何保存 dt
分区中转换为Parquets3文件夹结构?当我运行这个示例时,它只输出 s3://myBucket/pq/000000_0
而不是 s3://myBucket/pq/dt=2009-04-14-04-05/000000_0
以下是hql,它设置了将json引入配置单元表的接口:
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string,
number string,
processId string,
browserCookie string,
requestEndTime string,
timers struct<modelLookup:string, requestTime:string>,
threadId string,
hostname string,
sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' )
LOCATION 's3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions' ;
msck repair table impressions;
这里是hql转换为Parquet地板
CREATE EXTERNAL TABLE parquet_hive (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string)
STORED AS PARQUET
LOCATION 's3://mybucket/pq/';
INSERT OVERWRITE TABLE parquet_hive SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip FROM impressions where dt='2009-04-14-04-05';
2条答案
按热度按时间jv4diomz1#
首先,添加
PARTITIONED BY (dt string)
至parquet_hive
定义。第二-
如果要逐个分区插入数据,必须声明要插入的分区。注意
PARTITION (dt='2009-04-14-04-05')
```INSERT OVERWRITE TABLE parquet_hive PARTITION (dt='2009-04-14-04-05')
SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip
FROM impressions where dt='2009-04-14-04-05'
;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE parquet_hive PARTITION (dt)
SELECT requestbegintime,adid,impressionid,referrer,useragent,usercookie,ip,dt
FROM impressions where dt='2009-04-14-04-05'
;
lqfhib0f2#
您可以简单地包含相同的
PARTITIONED BY (dt string)
参数,它将创建相同的目录结构。在这种情况下
dt
字段(大概是日期)实际上存储在目录名中。为每个值创建一个单独的目录。