在配置单元分区中写入子目录

kuuvgm7e  于 2021-06-26  发布在  Hive
关注(0)|答案(1)|浏览(387)

问题陈述
我有如下文件,其中包含模式事件\u time,ad\u id

  1. file_20170102-May have records with event_time for 20170101,20170102,20170103
  2. file_20170103-May have records with event_time for 20170102,20170103,20170104

这里event time是事件发生的时间,filename上的时间戳是收集事件的时间。因此filename上的时间戳和文件中的事件时间不同步。
当我将这些数据写入hive时,我肯定需要基于事件时间分区来编写数据,因为用户对基于事件时间的查询感兴趣。
所以我的输出如下

  1. /path/to/output/event_time=20170102/....parquet
  2. /path/to/output/event_time=20170103/....parquet

但是,我需要能够跟踪文件的时间戳,因为有时一个文件被重新发布,我们想去删除已处理的文件的基础上文件的时间戳。
有没有办法写这个/path/to/output/event\u time=20170101/20170202(文件\u tiemstamp)
请注意,在上面的20170102(文件\时间戳)是一个目录,而不是配置单元分区。
或者,我可以控制Parquet文件的名称,这样当我想删除一个文件名它很容易找出哪些文件删除

tzcvj98z

tzcvj98z1#

演示

下的文件 /home/dmarkovitz/myfiles 我的文件\u 1 \u 20161204.csv

  1. 20161204,1
  2. 20161203,2

我的文件\u 2 \u 20161205.csv

  1. 20161203,3
  2. 20161204,4
  3. 20161205,5
  4. 20161203,6

我的文件\u 3 \u 20161205.csv

  1. 20161205,7
  2. 20161205,8
  3. 20161203,9

Hive

  1. create external table myfiles
  2. (
  3. Event_Time string
  4. ,AD_id int
  5. )
  6. row format delimited
  7. fields terminated by ','
  8. stored as textfile
  9. location 'file:///home/dmarkovitz/myfiles'
  10. ;
  1. select *
  2. ,input__file__name
  3. from myfiles
  4. ;
  1. +------------+-------+-----------------------------------------------------+
  2. | event_time | ad_id | input__file__name |
  3. +------------+-------+-----------------------------------------------------+
  4. | 20161204 | 1 | file:/home/dmarkovitz/myfiles/myfile_1_20161204.csv |
  5. | 20161203 | 2 | file:/home/dmarkovitz/myfiles/myfile_1_20161204.csv |
  6. | 20161205 | 7 | file:/home/dmarkovitz/myfiles/myfile_3_20161205.csv |
  7. | 20161205 | 8 | file:/home/dmarkovitz/myfiles/myfile_3_20161205.csv |
  8. | 20161203 | 9 | file:/home/dmarkovitz/myfiles/myfile_3_20161205.csv |
  9. | 20161203 | 3 | file:/home/dmarkovitz/myfiles/myfile_2_20161205.csv |
  10. | 20161204 | 4 | file:/home/dmarkovitz/myfiles/myfile_2_20161205.csv |
  11. | 20161205 | 5 | file:/home/dmarkovitz/myfiles/myfile_2_20161205.csv |
  12. | 20161203 | 6 | file:/home/dmarkovitz/myfiles/myfile_2_20161205.csv |
  13. +------------+-------+-----------------------------------------------------+
  1. create table mytable
  2. (
  3. AD_id int
  4. )
  5. partitioned by (file_dt date,Event_Time date)
  6. stored as parquet
  7. ;
  1. set hive.exec.dynamic.partition.mode=nonstrict;
  1. insert into mytable partition (file_dt,Event_Time)
  2. select ad_id
  3. ,from_unixtime(unix_timestamp(split(input__file__name,'[_.]')[2],'yyyyMMdd'),'yyyy-MM-dd')
  4. ,from_unixtime(unix_timestamp(Event_Time,'yyyyMMdd'),'yyyy-MM-dd')
  5. from myfiles
  6. ;
  1. show partitions mytable
  2. ;
  1. +------------------------------------------+
  2. | partition |
  3. +------------------------------------------+
  4. | file_dt=2016-12-04/event_time=2016-12-03 |
  5. | file_dt=2016-12-04/event_time=2016-12-04 |
  6. | file_dt=2016-12-05/event_time=2016-12-03 |
  7. | file_dt=2016-12-05/event_time=2016-12-04 |
  8. | file_dt=2016-12-05/event_time=2016-12-05 |
  9. +------------------------------------------+
  1. select *
  2. ,input__file__name
  3. from mytable
  4. ;
  1. +-------+------------+------------+----------------------------------------------------------------------+
  2. | ad_id | file_dt | event_time | input__file__name |
  3. +-------+------------+------------+----------------------------------------------------------------------+
  4. | 2 | 2016-12-04 | 2016-12-03 | file:/mydb/mytable/file_dt=2016-12-04/event_time=2016-12-03/000000_0 |
  5. | 1 | 2016-12-04 | 2016-12-04 | file:/mydb/mytable/file_dt=2016-12-04/event_time=2016-12-04/000000_0 |
  6. | 9 | 2016-12-05 | 2016-12-03 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-03/000000_0 |
  7. | 3 | 2016-12-05 | 2016-12-03 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-03/000000_0 |
  8. | 6 | 2016-12-05 | 2016-12-03 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-03/000000_0 |
  9. | 4 | 2016-12-05 | 2016-12-04 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-04/000000_0 |
  10. | 7 | 2016-12-05 | 2016-12-05 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-05/000000_0 |
  11. | 8 | 2016-12-05 | 2016-12-05 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-05/000000_0 |
  12. | 5 | 2016-12-05 | 2016-12-05 | file:/mydb/mytable/file_dt=2016-12-05/event_time=2016-12-05/000000_0 |
  13. +-------+------------+------------+----------------------------------------------------------------------+
  1. explain dependency
  2. select *
  3. from mytable
  4. where Event_Time = date '2016-12-04'
  5. ;

{“input_tables”:[{“tablename”:local_db@mytable“,”tabletype“:”managed\u table“}],”input\u partitions“:[{”partitionname“:”local_db@mytable@file_dt=2016-12-04/event_time=2016-12-04“},{“分区名称”:local_db@mytable@file_dt=2016-12-05/event_time=2016-12-04"}]}
猛击

  1. tree mytable
  1. mytable
  2. ├── file_dt=2016-12-04
  3.    ├── event_time=2016-12-03
  4.       └── 000000_0
  5.    └── event_time=2016-12-04
  6.    └── 000000_0
  7. └── file_dt=2016-12-05
  8. ├── event_time=2016-12-03
  9.    └── 000000_0
  10. ├── event_time=2016-12-04
  11.    └── 000000_0
  12. └── event_time=2016-12-05
  13. └── 000000_0
展开查看全部

相关问题