如何列出包含特定字符串的元组?

ddhy6vgd  于 2021-06-25  发布在  Pig
关注(0)|答案(2)|浏览(565)
Input:
Visit ID             Events
101                154,2,135
124                1, 120, 1050,2302
139                200, 150, 1, 320
140                30023, 200

从新手到Pig。想知道如何使用pig脚本列出事件中包含“1”的visitid行。
谢谢!
我试过的代码:

a = LOAD '/user/a6000518-a/AdobeHourlySampleHit/hit_data.tsv' using PigStorage('\t');  
b= foreach a GENERATE REGEX_EXTRACT_ALL($2, '(.*,1,.*|1,.*|.*,1)') as post_event_list;
c= FILTER b BY $0 is not NULL;
d= DISTINCT c;
dump d;

这只打印行包含“1”的事件列。如果使用visitid生成,则得到的结果不正确。我想打印visitid以及包含“1”的事件。

jqjz2hbq

jqjz2hbq1#

你可以写一个 python udf 并查看字符串中是否存在有问题的字符;可能会让事情简单得多。
python自定义项:


# !/usr/bin/python

@outputSchema("flg:int")
def tuple_contains(tup, val):
    try:
        if val in tup:
            return 1
        else:
            return 0
    except:
        return 0

脚本:

REGISTER /path/to/jars/tuple_contains.py USING jython AS udf;

data = LOAD 'data' AS (visit_id:chararray, event_list:chararray);
A = FILTER data BY udf.tuple_contains(STRSPLIT(event_list, ','), '1') == 1;
B = FOREACH A GENERATE visit_id, event_list;  -- ... other columns
DUMP B;

输出:

124    1,120,1050,2302
139    200,150,1,320
bvuwiixz

bvuwiixz2#

我想出来了。可能不是有效的编码方式,但得到的输出准确。

a = LOAD '/user/a6000518-a/AdobeHourlySampleHit/hit_data.tsv' using PigStorage('\t');  
b= FILTER a BY $283 == '0';  
c= FILTER b BY $298 != '5' AND $298 != '8' AND $298 != '7' AND $298 != '9';
d= FOREACH c GENERATE CONCAT(CONCAT(CONCAT($780,$781),$942),$948) as (visitID:bytearray), $614 as post_event_list, $656 as post_product_list;
e= FILTER d BY post_event_list != ' ' OR post_event_list != '' OR post_event_list is not NULL;
f= FOREACH e GENERATE REGEX_EXTRACT_ALL(post_event_list, '(.*,1,.*|1,.*|.*,1)') as purchase_event, visitID, post_product_list;
g= FILTER f BY $0 is not NULL;
h= dump g;

相关问题