hadoop文件在pig中逐字段比较

wrrgggsh  于 2021-05-30  发布在  Hadoop
关注(0)|答案(1)|浏览(326)

我有两份档案
文件1

id,sal,location,code
1000,1000,jupiter,F
1001,2000,jupiter,F
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F

文件2

id,sal,location,code
1000,2000,jupiter,F
1001,2000,jupiter,Z
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F

当我比较文件1和文件2时,我需要一个如下的输出

1000, sal
1001,code

基本上,它应该告诉我什么领域是改变了从以前的文件随着id。这可以在Pig。

9gm1akwq

9gm1akwq1#

您可以很容易地解决这个问题,但最具挑战性的部分将是您提到的输出格式。它需要一点复杂的逻辑来获得输出格式。
我已经修复了大多数的边缘情况,但你可以检查你的输入,以确保它适用于所有的组合。
文件1:

1000,1000,jupiter,F
1001,2000,jupiter,F
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F

文件2:

1000,2000,jupiter,F
1001,2000,jupiter,Z
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F

Pig手稿:

A = LOAD 'file1' USING PigStorage(',') AS (id,sal,location,code);
    B = LOAD 'file2' USING PigStorage(',') AS (id,sal,location,code);
    C = JOIN A BY id,B BY id;
    D = FOREACH C GENERATE A::id AS id,((A::sal == B::sal)?'':'sal') AS sal,
                                       ((A::location == B::location)?'':'location') AS location,
                                       ((A::code == B::code)?'':'code') AS code;

    --Remove the common fields between two files    
    E = FILTER D BY NOT (sal=='' AND location=='' AND code=='');

    --The below two lines are used to formatting the output 
    F = FOREACH E GENERATE id,REPLACE(BagToString(TOBAG(sal,location,code),','),'(,,$|,$)','') As finalOutput;
    G = FOREACH F GENERATE id,REPLACE(finalOutput,',,',',');
    DUMP G;

输出:

(1000,sal)
(1001,code)

相关问题