hadoop—在hive中将json数据从一个表插入到另一个表

zbdgwd5y  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(460)

我想根据数据上的关键字段将json数据从一个表插入到其他表中。
我的数据是这样的
{“rtype”:{“ver”:“1”,“os”:“ms”,“type”:“ns”,“vehicle”:“mh-3412”,“mod”:{“version”:[{“”:{“xyz”:“123.dfer”,“founder”:“3.0”,“ght”:“florida”,“fashion”:“fg45”,“cdc”:“new”,“dof”:“yes”,“ts”:“2000-04-01t00:00.171z”}}}}
{“rtype”:{“ver”:“1”,“os”:“ms”,“type”:“ns”,“vehicle”:“mh-3412”,“mod”:{“version”:[{“gap”:{“xvy”:“123.dfer”,“fah”:“3.0”,“ght”:“florida”,“fashion”:“fg45”,“cdc”:“new”,“dof”:“yes”,“ts”:“2000-04-01t00:00.171z”}}}}
{“rtype”:{“ver”:“1”,“os”:“ms”,“type”:“ns”,“vehicle”:“mh-3412”,“mod”:{“version”:[{“box”:{“vog”:“123.dfer”,“fah”:“3.0”,“fax”:“florida”,“fashion”:“fg45”,“cdc”:“new”,“dof”:“yes”,“ts”:“2000-04-01t00:00.171z”}}}}
这里基于版本,不管是“box”还是“gap”还是“”,我想将特定json行上的字段填充到另一个表中
例如:如果版本是“gap”,则填充一个表中的特定行如果是“box”,则填充到另一个表中。。。我是说所有的一排盒子。。。
如何使用hive实现这一点。请帮忙。
注意:我的json数据在一个表中作为string类型的列

clj7thdc

clj7thdc1#

演示

create table src (myjson string);

insert into src values
    ('{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"ABC":{"XYZ":"123.dfer","founder":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}')
   ,('{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"GAP":{"XVY":"123.dfer","FAH":"3.0","GHT":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}')
   ,('{"Rtype":{"ver":"1","os":"ms","type":"ns","vehicle":"Mh-3412","MOD":{"Version":[{"BOX":{"VOG":"123.dfer","FAH":"3.0","FAX":"Florida","fashion":"fg45","cdc":"new","dof":"yes","ts":"2000-04-01T00:00:00.171Z"}}]}}}')
;

create table trg_abc (myjson string);
create table trg_gap (myjson string);
create table trg_box (myjson string);
from src
insert into trg_abc select myjson where get_json_object(myjson,'$.Rtype.MOD.Version[0].ABC') is not null
insert into trg_gap select myjson where get_json_object(myjson,'$.Rtype.MOD.Version[0].GAP') is not null
insert into trg_box select myjson where get_json_object(myjson,'$.Rtype.MOD.Version[0].BOX') is not null
;
omjgkv6w

omjgkv6w2#

首先,您需要将数据作为json存储在配置单元表中:
我假设您的配置单元表是外部的(通常是-请检查它) SHOW CREATE TABLE your_table ).
例如,如果是这样,整个数据集就位于某个hdfs/s3路径中 s3a://your_bucket/your_jsons_location/ 下载json-udf-1.3.7-jar-with-dependencies.jar并运行 ADD JARS s3a://your_bucket/lib/json-udf-1.3.7-jar-with-dependencies.jar; 然后必须为每个json模式创建一个专用的json表:

CREATE EXTERNAL TABLE boxes
(Rtype struct<ver:string,os:string,type:string,vehicle:string,MOD:struct<Version:array<struct<BOX:struct<VOG:string,FAH:string,FAX:string,fashion:string,cdc:string,dof:string,ts:string>>>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 's3a://your_bucket/your_jsons_location/';

CREATE EXTERNAL TABLE gaps
(Rtype struct<ver:string,os:string,type:string,vehicle:string,MOD:struct<Version:array<struct<GAP:struct<XVY:string,FAH:string,GHT:string,fashion:string,cdc:string,dof:string,ts:string>>>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 's3a://your_bucket/your_jsons_location/';

CREATE EXTERNAL TABLE abcs
(Rtype struct<ver:string,os:string,type:string,vehicle:string,MOD:struct<Version:array<struct<ABC:struct<XYZ:string,founder:string,GHT:string,fashion:string,cdc:string,dof:string,ts:string>>>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 's3a://your_jsons_location/';

现在如果你要跑:

SELECT * FROM boxes;
SELECT * FROM gaps;
SELECT * FROM abcs;

您将看到每个表只正确地解析了匹配的json(根据create语句中指定的模式)。每个表中不匹配的为空。
要筛选出不相关的记录,请运行: SELECT * FROM s WHERE Rtype.mod.version[0]. IS NOT NULL; 注意:这个完整的解释假设json存储在hive表的外部(我特别使用了s3,但也可以是hdfs)

相关问题