是否有方法检查在90天前创建的配置单元外部表,并将这些表与底层hdfs数据一起删除。这可以在unix脚本中实现吗?
yhuiod9q1#
如果配置单元表路径为 /path/your_hive_table_path/ 具体如下:
/path/your_hive_table_path/
hadoop --cluster your-hadoop-cluster fs -ls /path/your_hive_table_path/drwxrwxrwx+ - h_mifi supergroup 0 2019-01-24 10:33 /path/your_hive_table_path//mifidw_car_insurance_expire_month_datadrwxrwxrwx+ - h_mifi supergroup 0 2019-01-24 10:39 /path/your_hive_table_path//mifidw_car_ownerdrwxr-xr-x+ - h_mifi supergroup 0 2019-05-30 03:01 /path/your_hive_table_path//push_credit_card_mine_result_newdrwxr-xr-x+ - h_mifi supergroup 0 2019-05-30 03:41 /path/your_hive_table_path//push_live_payment_bill_mine_result_new
hadoop --cluster your-hadoop-cluster fs -ls /path/your_hive_table_path/
drwxrwxrwx+ - h_mifi supergroup 0 2019-01-24 10:33 /path/your_hive_table_path//mifidw_car_insurance_expire_month_data
drwxrwxrwx+ - h_mifi supergroup 0 2019-01-24 10:39 /path/your_hive_table_path//mifidw_car_owner
drwxr-xr-x+ - h_mifi supergroup 0 2019-05-30 03:01 /path/your_hive_table_path//push_credit_card_mine_result_new
drwxr-xr-x+ - h_mifi supergroup 0 2019-05-30 03:41 /path/your_hive_table_path//push_live_payment_bill_mine_result_new
我们可以得到表文件的最新更新日期,如下所示:
hadoop --cluster your-hadoop-cluster fs -ls /path/your_hive_table_path/ | awk -F'[ ]+' '{print $6}'2019-01-242019-01-242019-05-302019-05-30
hadoop --cluster your-hadoop-cluster fs -ls /path/your_hive_table_path/ | awk -F'[ ]+' '{print $6}'
2019-01-24
2019-05-30
我们需要一个 loop 检查每个表是否超过90天并执行 remove 以及 drop 操作。下面是完整的shell脚本,我已经测试过了,效果不错,希望对你有所帮助。
loop
remove
drop
hadoop --cluster your-hadoop-cluster fs -ls /path/your_hive_table_path/ | grep '/path/your_hive_table_path/' | while read linedo #Get the update date of hive table date_str=`echo $line | awk -F'[ ]+' '{print $6}'` #get the path of hive table table_path=`echo $line | awk -F'[ ]+' '{print $8}'` #Get the table name of hive table table_name=`echo $table_path | awk -F'/' '{print $7}' ` today_date_stamp=`date +%s` table_date_stamp=`date -d $date_str +%s` stamp_diff=`expr $today_date_stamp - $table_date_stamp` #Get the diff days from now days_diff=`expr $stamp_diff / 86400` #if diff days is greater than 90, rm and drop. if [ $days_diff -gt 90 ];then #remove the hdfs file hadoop --cluster your-hadoop-cluster fs -rm $table_path #drop the hive table hive -e"drop table $table_name" fidone
hadoop --cluster your-hadoop-cluster fs -ls /path/your_hive_table_path/ | grep '/path/your_hive_table_path/' | while read line
do
#Get the update date of hive table
date_str=`echo $line | awk -F'[ ]+' '{print $6}'`
#get the path of hive table
table_path=`echo $line | awk -F'[ ]+' '{print $8}'`
#Get the table name of hive table
table_name=`echo $table_path | awk -F'/' '{print $7}' `
today_date_stamp=`date +%s`
table_date_stamp=`date -d $date_str +%s`
stamp_diff=`expr $today_date_stamp - $table_date_stamp`
#Get the diff days from now
days_diff=`expr $stamp_diff / 86400`
#if diff days is greater than 90, rm and drop.
if [ $days_diff -gt 90 ];then
#remove the hdfs file
hadoop --cluster your-hadoop-cluster fs -rm $table_path
#drop the hive table
hive -e"drop table $table_name"
fi
done
1条答案
按热度按时间yhuiod9q1#
如果配置单元表路径为
/path/your_hive_table_path/
具体如下:我们可以得到表文件的最新更新日期,如下所示:
我们需要一个
loop
检查每个表是否超过90天并执行remove
以及drop
操作。下面是完整的shell脚本,我已经测试过了,效果不错,希望对你有所帮助。