apache pig中union和join的组合

ffscu2ro  于 2021-05-29  发布在  Hadoop
关注(0)|答案(3)|浏览(421)

我在hdfs中有两个文件,其中包含如下数据:file1:

id,name,age
1,x1,15
2,x2,14
3,x3,16

文件2:

id,name,grades
1,x1,A
2,x2,B
4,y1,A
5,y2,C

我要生成以下输出:

id,name,age,grades
1,x1,15,A
2,x2,14,B
3,x3,16,
4,y1,,A
5,y2,,C

我正在使用apache pig来执行操作,是否可以在pig中获得上述输出。这是一种联合,两者兼而有之。

px9o7tmv

px9o7tmv1#

u1 = load 'PigDir/u1' using PigStorage(',') as (id:int,name:chararray,age:int);
u2 = load 'PigDir/u2' using PigStorage(',') as (id:int, name:chararray,grades:chararray);

uj = join u2 by id full outer,u1 by id;

uif = foreach uj generate ($0 is null ?$3:$0) as id,($1 is null ? $4 : $1) as name,$5 as age,$2 as grades;
eni9jsuy

eni9jsuy2#

A = load 'pdemo/File1' using PigStorage(',') as(id:int,name:chararray,age:chararray);   
B = load 'pdemo/File2' using PigStorage(',') as(id:int,name:chararray,grades:chararray);

lj = join A by id left outer,B by id;
rj = join A by id right outer,B by id; 

lj1 = foreach lj generate A::id as id,A::name as name,A::age as age,B::grades as grades;
rj1 = foreach rj generate B::id as id,B::name as name,A::age as age,B::grades as grades;

res = union lj1,rj1;  
FinalResult = distinct res;

根据性能,第二种方法更好

A1 = foreach A generate id,name;   
B1 = foreach B generate id,name;

M2 = union A1,B1; 
M2 = distinct M2;

M2A = JOIN M2 by id left outer,A by id;
M2AB = JOIN M2A by M2::id left outer, B by id;

Res = foreach M2AB generate M2A::M2::id as id,M2A::M2::name as name,M2A::A::age as age,B::grades as grades;

希望这会有帮助!!

pvcm50d1

pvcm50d13#

因为你可以做工会和加入Pig这当然是可能的。
不必深入研究确切的语法,我可以告诉你这应该是可行的(过去曾使用过类似的解决方案)。
假设我们有a和b。
取a和b的前两列为a2和b2
将a2和b2合并为m2
独特的m2
现在你有了“索引”矩阵,我们只需要添加额外的列。
用a和b左连接m2
生成相关列
就这样!

相关问题