雅典娜中json的配置单元数组读取错误,不能取消

myss37ts  于 2021-06-25  发布在  Hive
关注(0)|答案(1)|浏览(348)

我有个专栏叫 uf 包含json对象数组的。这是一个模型:

[
    {"type": "browserId", "name": "", "value": "unknown"},
    {"type": "campaign", "name": "", "value": "om_227dec0082a5"},
    {"type": "custom", "name": "2351350529", "value": "10148"},
    {"type": "custom", "name": "9501713387", "value": "true"},
    {"type": "custom", "name": "9517735577", "value": "true"},
    {"type": "custom", "name": "9507402548", "value": "true"},
    {"type": "custom", "name": "9733902068", "value": "true"}
]

我试图把它作为儿童唱片,但由于某些原因,我找不到正确的方法来取消它第一。然后我注意到我的整个数组被 Package 到另一个json对象中。。
我就在这里:
我尝试了简单的选择,结果是:

[{type=[{"type": "browserId", "name": "", "value": "ff"}, name=null, value=null}]

此列的定义如下:

{
    "Name": "uf",
    "Type": "array<struct<type:string,name:string,value:string>>"
}

定义是否不正确,这就是为什么我将整个数组 Package 在另一个json数组中?
--编辑这里是我的csv文件(制表符分隔)的例子。在过去的两天里,我一直在试图弄清楚是不是json让glue无法将列识别为json数组,但我用简单的json数组创建了一个新列,并将其正确地分配为 array<struct 但在询问之后,我得到了与上面完全相同的问题

timestamp   project_id  campaign_id experiment_id   variation_id    layer_holdback  audience_names  end_user_id uuid    session_id  snippet_revision    user_ip user_agent  user_engine user_engine_version referer global_holdback event_type  event_name  uf  active_views    event_features  event_metrics   event_uuid
1570326511  74971132    11089500404 11097730080 11078120202 false   []  oeu1535997971348r0.4399811351004357     AUTO    6540    5.91.170.0  Mozilla/5.0 (Linux; Android 7.0; SAMSUNG SM-G925F Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/9.2 Chrome/67.0.3396.87 Mobile Safari/537.36  js  0.128.0 https://www.zavamed.com/uk/account/ false   view_activated  10832783364 [{"type": "browserId", "name": "", "value": "unknown"}, {"type": "device", "name": "", "value": "mobile"}, {"type": "device_type", "name": "", "value": "phone"}, {"type": "referrer", "name": "", "value": "https:\/\/www.google.co.uk\/"}, {"type": "source_type", "name": "", "value": "campaign"}, {"type": "currentTimestamp", "name": "", "value": "-1631518596"}, {"type": "offset", "name": "", "value": "-60"}]    []  []  []  4926a5f1-bbb5-4553-9d0b-b26f773fa0f4
b1zrtrql

b1zrtrql1#

我上传了一个示例csv文件到s3与您提供的内容。然后我在上面运行了一个胶水爬虫。下面是我最终得到的一个表模式:

CREATE EXTERNAL TABLE `question_58765672`(
  `timestamp` bigint, 
  `project_id` bigint, 
  `campaign_id` bigint, 
  `experiment_id` bigint, 
  `variation_id` bigint, 
  `layer_holdback` boolean, 
  `audience_names` array<string>, 
  `end_user_id` string, 
  `uuid` string, 
  `session_id` string, 
  `snippet_revision` bigint, 
  `user_ip` string, 
  `user_agent` string, 
  `user_engine` string, 
  `user_engine_version` string, 
  `referer` string, 
  `global_holdback` boolean, 
  `event_type` string, 
  `event_name` bigint, 
  `uf` string, 
  `active_views` array<string>, 
  `event_features` array<string>, 
  `event_metrics` array<string>, 
  `event_uuid` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://__S3_PATH_IN_MY_BUCKET__/'
TBLPROPERTIES (
  'CrawlerSchemaDeserializerVersion'='1.0', 
  'CrawlerSchemaSerializerVersion'='1.0', 
  'UPDATED_BY_CRAWLER'='some-crawler', 
  'areColumnsQuoted'='false', 
  'averageRecordSize'='553', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'='\t', 
  'objectCount'='1', 
  'recordCount'='2', 
  'sizeKey'='1109', 
  'skip.header.line.count'='1', 
  'typeOfData'='file')

正如你所看到的,这列 uf 作为弦,我并不感到惊讶。为了取消此列的注解,我必须手动将其转换为正确的类型 ARRAY(JSON) :

SELECT
    "timestamp",
    _unnested_column
FROM
    "stackoverflow"."question_58765672",
    UNNEST( CAST(json_parse(uf) AS ARRAY(JSON)) ) AS t(_unnested_column)

结果:

timestamp   _unnested_column
1   1570326511  {"name":"","type":"browserId","value":"unknown"}
2   1570326511  {"name":"","type":"device","value":"mobile"}
3   1570326511  {"name":"","type":"device_type","value":"phone"}
4   1570326511  {"name":"","type":"referrer","value":"https://www.google.co.uk/"}
5   1570326511  {"name":"","type":"source_type","value":"campaign"}
6   1570326511  {"name":"","type":"currentTimestamp","value":"-1631518596"}
7   1570326511  {"name":"","type":"offset","value":"-60"}

然后我想创建一个雅典娜视图,在哪里,专栏 uf 正确铸造:

CREATE OR REPLACE VIEW question_58765672_v1_json AS
SELECT
    CAST(json_parse(uf) AS ARRAY(JSON)) as uf
    -- ALL other columns from your table
FROM 
    "stackoverflow"."question_58765672"

但是,我得到了以下错误:
列uf的列类型无效:不支持的配置单元类型:json
我猜,这是列的模式吗 uf 要么太复杂,胶水爬虫,以正确地识别它或只是简单地不支持所使用的serde,即。 'org.apache.hadoop.mapred.TextInputFormat' 或者 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' .

相关问题