创建具有复杂数据类型的配置单元外部表,并从csv或tsv加载,其中包含具有序列化json对象的少数列

wbgh16ku  于 2021-06-24  发布在  Hive
关注(0)|答案(0)|浏览(180)

我有一个csv(或tsv),其中一列('nw\u day'在下面的示例中)具有序列化的数组对象,另一列('res\u m'在下面的示例中)具有序列化的json对象。它还具有string、timestamp和float数据类型的列。
对于看起来有点像的tsv(显示第一行)

+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
|  com_id  |    w_start_time     |  cap  |                   nw_day                      |                                res_m                                   | 
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
|  dtf_id  | 2019-04-24 06:00:03 | 444.3 | {'Fri','Mon','Sat','Sun','Thurs','Tue','Wed'} | {"some_str":"str_one","some_n":1,"some_t":2019-04-24 06:00:03.700+0000}|
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+

我试过下面的说法,但没有给我完美的结果。

CREATE EXTERNAL TABLE IF NOT EXISTS table_name(
  com_id STRING,
  w_start_time TIMESTAMP,
  cap FLOAT,
  nw_day array <STRING>, 
  res_m STRUCT <
    some_str: STRING,
    some_n: BIGINT,
    some_t: TIMESTAMP
  >)
  COMMENT 's_e_s'
  ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t'
  COLLECTION ITEMS TERMINATED BY ','
  STORED AS TEXTFILE
  LOCATION '/location/to/folder/containing/csv'
  TBLPROPERTIES ("skip.header.line.count"="1");

所以,我想我反序列化这些对象到配置单元复杂的数据类型与数组和结构。但这并不是我跑步时得到的

select * from table_name limit 1;

这让我

+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
|  com_id  |    w_start_time     |  cap  |                                   nw_day                             |                                                          res_m                                                             | 
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
|  dtf_id  | 2019-04-24 06:00:03 | 444.3 | ["{'Fri'"," 'Mon'"," 'Sat'"," 'Sun'"," 'Thurs'"," 'Tue'"," 'Wed'}"]  | {"some_str":"{\"some_str\":\"str_one\",\"some_n\":1,\"some_t\":2019-04-24 06:00:03.700+0000}\","some_n":null,"some_t":null}|
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+

因此,它将整个对象看作一个字符串,并用分隔符拆分字符串。
我需要一些帮助,了解如何将数据从csv/tsv加载到配置单元中的复杂数据类型。
我发现了一个类似的问题,但是需求有点不同,并且没有涉及复杂的数据类型。
任何帮助都将不胜感激。如果不能做到这一点,并且必须在加载之前包含预处理步骤,那么hive中复杂数据类型加载的输入数据示例将对我有所帮助。提前谢谢!

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题