pig regex extract然后过滤未命名的regex元组

1dkrff03  于 2021-06-21  发布在  Pig
关注(0)|答案(1)|浏览(363)

我有一个字符串:

[["structure\/","structure\/home_page\/","structure\/home_page\/headline_list\/","structure\/home_page\/latest\/","topic\/","topic\/location\/","topic\/location\/united_states\/","topic\/location\/united_states\/ohio\/","topic\/location\/united_states\/ohio\/franklin\/","topic\/news\/","topic\/news\/politics\/","topic\/news\/politics\/elections\/,topic\/news\/politics\/elections\/primary\/"]]

我想正则表达式提取所有元素,将其转换为元组中的元素,并由 "," . 然后我需要过滤掉那些不包含 structure 以及 location . 但是,我遇到了一个无法过滤regex类型的错误。你知道吗?顺便说一下,最终目标是解析出最长的层次结构,如 (topic|news|politics|elections|primary) 更新脚本:

data = load load '/web/visit_log/20160303' 
            USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as json:map[];
a = foreach data generate json#section as sec_type;
b = foreach act_flt GENERATE ..host, REGEX_EXTRACT_ALL(act_type, 'topic..(?!location)(.*?)"') as extr;
store b into /user/tad/sec_hir
xytpbqjk

xytpbqjk1#

筛选器匹配的语法似乎不正确。数据中似乎没有()。

c = filter b by not extr matches '(structure|location)';

尝试将此更改为

c = filter b by not (extr matches 'structure|location');

相关问题