从出现在特定单词之后的字符串中提取值

vjrehmav  于 2021-06-27  发布在  Hive
关注(0)|答案(3)|浏览(319)

json脚本作为字符串传递,我需要在 content_id 以便进一步Map。样本数据如下:

{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25,  "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36]], "packager_path": "/opt/bento4"}}], "vmaf_path": "/vmaf"}

参数是动态的,所以我不能使用substr函数提取,也不能在特定字符出现一定次数后使用count来提取。

smdncfj3

smdncfj31#

您的示例中的json格式不正确,它包含额外的 ] 关门后还有尾巴 } . 对于正确的json,您可以使用 get_json_object ,例如:

select get_json_object(src_json,'$.url.content_id') from
    (
     select '{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25,  "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36], "packager_path": "/opt/bento4"}}' as src_json 
     )s
    ;

结果:

OK
1000231205
Time taken: 21.606 seconds, Fetched: 1 row(s)
cgvd09ve

cgvd09ve2#

您可以在具有匹配regex的配置单元中使用regexp\u extract函数,仅从content\u id中提取数字。
例子:

select regexp_extract(col1,'"content_id":\\s"(\\d+)"',1) from (
select string('{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25,  "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36]], "packager_path": "/opt/bento4"}}], "vmaf_path": "/vmaf"}')col1
)t;
+-------------+--+
|     _c0     |
+-------------+--+
| 1000231205  |
+-------------+--+

正则表达式描述:

"content_id":\\s"(\\d+)" //match literal "content_id": + any space + "digit inside quotes"
ulmd4ohb

ulmd4ohb3#

找到了一种昂贵的方法,通过regex和子串函数的组合来实现

substr(split(regexp_extract(message,'content_id([^&]*)'), '"')[3],1) as content_id

相关问题