如何执行group by,然后在pig中的其他列上使用distinct

izj3ouym  于 2021-06-02  发布在  Hadoop
关注(0)|答案(2)|浏览(434)

我刚刚开始学Pig,需要一些帮助来回答下面的问题。提前谢谢!
例如:我有如下输入:
职业类别名称

Actress       Acting     Marion Cotillard
Actor         Acting     Liam Nelson
Tennis Plyr   Athletics  Roger Federer
Football Plyr Athletics  Neymar
Actor         Acting     Tom Hanks
Actress       Acting     Elizabeth Banks
US Senator    Politics   Elizabeth Warren
Football Plyr Athletics  Mesut Ozil

我想知道一个类别有多少种。表演有两种类型,一种是演员,另一种是演员。因此,结果将是2。面临的问题:无法使用“职业”列将输出与“按类别分组”区分开来:(

zsohkypk

zsohkypk1#

先区分,然后按类别分组。假设已将数据加载到关系a中。
选择加载后的两列。
区别关系
按类别分组
统计每个类别的职业

B = FOREACH A GENERATE Occupation as Occupation,Category as Category;
C = DISTINCT B;
D = GROUP C BY $1; 
E = FOREACH D GENERATE group,COUNT(C.Occupation); 
DUMP E;

2mbi3lxu

2mbi3lxu2#

试试这个:

x= load '<data>' using PigStorage('\t') as (occupation:chararray,category:chararray,name:chararray);

 x_grouped= group x by category;

x_grouped_distinct= foreach x_grouped { cat= distinct $1.occupation; generate $0, cat, COUNT(cat);}; 

dump x_grouped_distinct;

相关问题