hadoop从Pig的袋子里提取元组

gudnpqoy  于 2021-06-21  发布在  Pig
关注(0)|答案(3)|浏览(483)

文件内容(test.txt):

Some    specific    column      value: x192.168.1.2     blah       blah
Some    specific    row        value: y192.168.1.3      blah       blah
Some    specific    field      value: z192.168.1.4     blah      blah

pig查询:

A = LOAD 'test.txt' USING PigStorage('\t') AS (data1: chararray , data2: chararray , data3: chararray, data4: chararray , data5: chararray , data6: chararray);

B = foreach A generate data3, data4;

C = filter B by data3 matches 'row';

D = foreach C generate data4;

E = foreach D generate TOKENIZE(data4);

输出:

((value:), (y192.168.1.3))

现在我想在这个输出包中提取特定的元组,比如说第二个元组(y192.168.1.3)。在此之后,我想提取ip地址。我正试着处理自定义项,但被卡住了。

nxowjjhe

nxowjjhe1#

public class someClass extends EvalFunc<String>
{
   public String exec(Tuple input) throws IOException {
     DataBag bag = (DataBag)input.get(0);
     Iterator<Tuple> it = bag.iterator();
     Tuple tup;
     for(int i = 0; i < 2; i++)
     {
       tup = it.next();
     }
     String ipString = tup.get(0);
     String ip = //get ip from string with a regex
     return ip;
   }
 }

当然,您应该添加一些输入检查(空输入、袋大小1等)并保护代码。

omhiaaxx

omhiaaxx2#

您可以使用扁平运算符来扁平行李,然后使用筛选器来提取ip地址。

E = foreach C generate flatten(TOKENIZE(data4));
F = filter E by $0 matches '.\\d+\\.\\d+\\.\\d+\\.\\d+'

希望这有帮助

plupiseo

plupiseo3#

这就是我要做的。
Pig手稿

A = LOAD 'test.txt' USING PigStorage('\t') AS (data1: chararray , data2: chararray , data3: chararray, data4: chararray , data5: chararray , data6: chararray);
B = foreach A generate data3, data4;
C = filter B by data3 matches 'row';
D = foreach C generate data4;
E = foreach D generate REGEX_EXTRACT($0,'value: .([0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+).*', 1);

输出

(192.168.1.3)

如果需要,可以使用更疯狂的regexp来提取ip地址:使用regex从字符串中提取ip地址

相关问题