写一个组的唯一键作为文件夹名,包的内容作为记录?

kognpnkq  于 2021-06-25  发布在  Pig
关注(0)|答案(1)|浏览(386)

目的:将组的唯一键作为文件夹名,包内容作为记录。

File : employee.txt

 #JoiningDate   Employee Id     Employee Name
   20140302        1             A
   20140302        2             B
   20140302        3             C
   20140303        4             D
   20140303        5             E
   20140303        6             F

Pig脚本:

X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);

  Y =  group X by joining_date;

Output of this would be  (Y) :

(20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
(20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})

目标是在输出路径中有两个文件夹:

1. outputfolder/20140302 : having three records
            20140302,1,A
            20140302,2,B    
            20140302,3,C
    2. outputfolder/20140303  : 
            20140303,4,D
            20140303,5,E
            20140303,6,F

尝试

store Y into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');

结果如下:

1. outputfolder/20140302/20140302-0
            (20140302, {(20140302,1,A), (20140302,2,B), (20140302,3,C)})
     2. outputfolder/20140303/20140303-0
            (20140303, {(20140303,4,D), (20140303,5,E), (20140303,6,F)})
brgchamk

brgchamk1#

一种方法是在 store 命令。

X = load 'employee.txt' using PigStorage('\t') as (joining_date:chararray, employee_id:long, employee_name:chararray);
Y = group X by joining_date;
Z = FOREACH Y GENERATE FLATTEN($1);
store Z into 'outputfolder' using org.apache.pig.piggybank.storage.MultiStorage('outputfolder', '0', 'none', ',');

输出将存储在 outputfolder/20140302 文件夹和文件名的开头是这样的 20140302-0,000

相关问题