pig如何使用过滤器格式化半结构化csv

qrjkbowd  于 2021-06-02  发布在  Hadoop
关注(0)|答案(3)|浏览(256)

我有一个半结构化的csv,看起来像这样。

VTS,01,0099,7022606164,SP,GP,33,060646,A,1258.9805,N,07735.9303,E,0.0,278.6,280515,0000,00,4000,11,999,842,4B61
VTS,01,0099,7022606164,NM,GP,20,060637,A,1258.9805,N,07735.9302,E,0.0,278.6,280515,0000,00,4000,11,999,841,7407+++
VTS,66,0065,7022606164,NM,0,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++
VTS,01,0099,7022606164,NM,GP,22,060656,A,1258.9804,N,07735.9301,E,0.0,278.6,280515,0000,00,4000,11,999,843,8FEB+++
VTS,01,0099,7022606164,NM,GP,22,060721,A,1258.9803,N,07735.9304,E,0.0,278.6,280515,0000,00,4000,11,999,845,044D++++++
VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++
VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE

我想用这些数据做三个不同的表格。i、 e.一个带vts,一个带vts,99,另一个带vts,66。同样,我还需要删除“+++”附加的每一行,因为它是一个错误,为此我编写了这个pig脚本。

data = load '/user/simulator/SKYTRACK/27thMay2015' using PigStorage('\n') as (f1:chararray);
splt = foreach data generate FLATTEN(STRSPLIT($0, '\\+++'));
data_pkt = FILTER splt BY $0 MATCHES '.*VTS,01+.*';
sos_pkt = FILTER splt BY $1 MATCHES '.*VTS,99+.*';
health_pkt = FILTER splt BY $2 MATCHES '.*VTS,66+.*';

当我为每个表单独测试脚本时,我只接收一个输出,其余的没有输出, dump data_pkt; dump sos_pkt; dump health_pkt; 我是一个非常新的Pig,所以谁能帮我解决这个问题…这将是非常感谢。

qcuzuvrc

qcuzuvrc1#

要删除+++,还需要转义所有的“+”,而不仅仅是唯一的一个。你对这些优点的含义不是很清楚。您可以使用正则表达式来拆分:

"\\+{3,}"

因此,在你的Pig剧本中:

splt = foreach data generate FLATTEN(STRSPLIT($0, '\\+{3,}'));

尽管aman是正确的,但是我宁愿使用split而不是filter来分隔数据集:

a = load '/abc.txt';
 SPLIT a INTO 
     b01 IF $1 == 01,
     b66 IF $1 == 66,
     b99 IF $1 == 69;
agyaoht7

agyaoht72#

这将根据值筛选记录。

a = load '/abc.txt' using PigStorage(',');
 b1 = FILTER a by $1==01;
 b66 = FILTER a by $1==66;
 b99 = FILTER a by $1==99;

为了删除+++,您必须编写一个简单的pig udf。
输出:

(VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++)
(VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE)
w1jd8yoj

w1jd8yoj3#

这是工作什么体面现在。

data = load '/user/simulator/SKYTRACK/27thMay2015' using PigStorage(',');

splt = foreach data generate $0 as col0:chararray,$1 as col1:chararray,$2 as col2:chararray,$3 as col3:chararray,$4 as col4:chararray,$5 as col5:chararray,$6 as col6:chararray,$7 as col7:chararray,$8 as col8:chararray,$9 as col9:chararray,$10 as col10:chararray,$11 as col11:chararray,$12 as col12:chararray,$13, FLATTEN(STRSPLIT($14, '\\+++'));

data_pkt = FILTER splt BY $1 MATCHES '.*01+.*';
health_pkt = FILTER splt BY $1 MATCHES '.*66+.*';
sos_pkt = FILTER splt BY $1 MATCHES '.*99+.*';

但问题是三步走。

相关问题