所以我在向hql代码中添加csv文件并在hdfs上运行它时遇到了这个问题。我发现当插入数据时,它在分区部分得到空值,一些列得到删除,我尝试了许多不同的方法来插入数据,但我仍然得到这个奇怪的符号和丢失的列,就像它不能读取csv文件,这里是pic输入图像描述这里和这里的代码`
CREATE database covid_db;
use covid_db;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=500;
set hive.exec.max.dynamic.partitions.pernode=500;
CREATE TABLE IF NOT EXISTS covid_db.covid_staging
(
Country STRING,
Total_Cases DOUBLE,
New_Cases DOUBLE,
Total_Deaths DOUBLE,
New_Deaths DOUBLE,
Total_Recovered DOUBLE,
Active_Cases DOUBLE,
Serious DOUBLE,
Tot_Cases DOUBLE,
Deaths DOUBLE,
Total_Tests DOUBLE,
Tests DOUBLE,
CASES_per_Test DOUBLE,
Death_in_Closed_Cases STRING,
Rank_by_Testing_rate DOUBLE,
Rank_by_Death_rate DOUBLE,
Rank_by_Cases_rate DOUBLE,
Rank_by_Death_of_Closed_Cases DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED by ','
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_HDFS_LZ'
tblproperties ("skip.header.line.count"="1", "serialization.null.format" = "''");
CREATE EXTERNAL TABLE IF NOT EXISTS covid_db.covid_ds_partitioned
(
Country STRING,
Total_Cases DOUBLE,
New_Cases DOUBLE,
Total_Deaths DOUBLE,
New_Deaths DOUBLE,
Total_Recovered DOUBLE,
Active_Cases DOUBLE,
Serious DOUBLE,
Tot_Cases DOUBLE,
Deaths DOUBLE,
Total_Tests DOUBLE,
Tests DOUBLE,
CASES_per_Test DOUBLE,
Death_in_Closed_Cases STRING,
Rank_by_Testing_rate DOUBLE,
Rank_by_Death_rate DOUBLE,
Rank_by_Cases_rate DOUBLE,
Rank_by_Death_of_Closed_Cases DOUBLE
)
PARTITIONED BY (COUNTRY_NAME STRING)
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_HDFS_PARTITIONED';
FROM
covid_db.covid_staging
INSERT INTO TABLE covid_db.covid_ds_partitioned PARTITION(COUNTRY_NAME)
SELECT *,Country WHERE Country is not null;
CREATE EXTERNAL TABLE covid_db.covid_final_output
(
TOP_DEATH STRING,
TOP_TEST STRING
)
PARTITIONED BY (COUNTRY_NAME STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED by ','
STORED as TEXTFILE
LOCATION '/user/cloudera/ds/COVID_FINAL_OUTPUT';
`
1条答案
按热度按时间mwg9r5ms1#
第一:您正在检查文件内容,分区列不存储在文件中,而是存储在元数据中。动态创建的分区还有key=value格式的目录。所以,您在文件中看到的最后一列不是分区列,而是按已关闭案例的死亡排序。
第二:您没有在第二个表ddl中指定分隔符以及空格式。默认分隔符为“\001”(ctrl-a)。您可以指定分隔符,例如tab(\t)和所需的null:
但如果您希望能够区分空字符串和空字符串,最好不要重新定义空格式。