hive在引号字段中加载带有逗号的csv

5lwkijsr  于 2021-06-03  发布在  Hadoop
关注(0)|答案(6)|浏览(1126)

我正在尝试将csv文件加载到配置单元表中,如下所示:

CREATE TABLE mytable
(
num1 INT,
text1 STRING,
num2 INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

LOAD DATA LOCAL INPATH '/data.csv'
OVERWRITE INTO TABLE mytable;

csv由逗号(,)分隔,如下所示:

1, "some text, with comma in it", 123, "more text"

这将返回损坏的数据,因为第一个字符串中有','。
有没有办法设置文本分隔符或使配置单元忽略字符串中的'''?
我无法更改csv的分隔符,因为它是从外部源提取的。

h9vpoimq

h9vpoimq1#

org.apache.hadoop.hive.serde2.opencsvserde serde为我工作。我的分隔符是“|”,其中一列用双引号括起来。
查询:

CREATE EXTERNAL TABLE EMAIL(MESSAGE_ID STRING, TEXT STRING, TO_ADDRS STRING, FROM_ADDRS STRING, SUBJECT STRING, DATE STRING)
ROW FORMAT SERDE 'ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE'
WITH SERDEPROPERTIES (
     "SEPARATORCHAR" = "|",
     "QUOTECHAR"     = "\"",
     "ESCAPECHAR"    = "\""
)    
STORED AS TEXTFILE location '/user/abc/csv_folder';
brgchamk

brgchamk2#

从hive0.14开始,csv serde是hive安装的标准部分 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' (见:https://cwiki.apache.org/confluence/display/hive/csv+serde)

eivgtgni

eivgtgni3#

在以“\;”结尾的字段中添加反斜杠
例如:

CREATE  TABLE demo_table_1_csv
COMMENT 'my_csv_table 1'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'your_hdfs_path'
AS 
select a.tran_uuid,a.cust_id,a.risk_flag,a.lookback_start_date,a.lookback_end_date,b.scn_name,b.alerted_risk_category,
CASE WHEN (b.activity_id is not null ) THEN 1 ELSE 0 END as Alert_Flag 
FROM scn1_rcc1_agg as a LEFT OUTER JOIN scenario_activity_alert as b ON a.tran_uuid = b.activity_id;

我已经测试过了,而且成功了。

bxfogqkk

bxfogqkk4#

如果可以重新创建或分析输入数据,则可以为创建表指定转义字符:

ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\';

将接受该行作为4个字段

1,some text\, with comma in it,123,more text
ma8fv8wu

ma8fv8wu5#

将分隔符放在单引号中,这样就可以了。

ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

这会有用的

x6492ojm

x6492ojm6#

问题是 Hive 不处理引用的文本。您需要通过更改字段之间的分隔符来预处理数据(例如:使用hadoop流作业),也可以尝试使用自定义csv serde,该serde使用opencsv来解析文件。

相关问题