pig:获取每组的前n个值

ttygqcqt  于 2021-06-03  发布在  Hadoop
关注(0)|答案(2)|浏览(335)

我有已经分组和聚合的数据,看起来是这样的:

user    value      count
----    --------  ------
Alice   third      5
Alice   first      11
Alice   second     10
Alice   fourth     2
...
Bob     second     20
Bob     third      18
Bob     first      21
Bob     fourth     8
...

对于每个用户(alice和bob),我希望检索他们的前n个值(比如2),即“count”的排序项。所以我想要的输出是:

Alice first 11
Alice second 10
Bob first 21
Bob second 20

我怎样才能做到呢?

2q5ifsrm

2q5ifsrm1#

我刚刚观察到

top    = limit sorted 2;

top是一个内置函数,可能会抛出一个错误,所以我所做的唯一一件事就是在本例中更改了关系的名称,而不是

generate group, flatten(top);

它提供了输出

(Alice,Alice,first,11)
(Alice,Alice,second,10
(Bob,Bob,first,21)
(Bob,Bob,second,20)

修改如下-

records = load 'test1.txt' using PigStorage(',') as (user:chararray, value:chararray, count:int);
grpd = GROUP records BY user;
top2 = foreach grpd {
        sorted = order records by count desc;
        top1    = limit sorted 2;
        generate flatten(top1);
};

它给了我你想要的输出-

(Alice,first,11)
(Alice,second,10)
(Bob,first,21)
(Bob,second,20)

希望这有帮助。

jv4diomz

jv4diomz2#

一种方法是

records = LOAD '/user/nubes/ncdc/micro-tab/top.txt' AS (user:chararray,value:chararray,counter:int);
grpd = GROUP records BY user;

top3 = foreach grpd {
        sorted = order records by counter desc;
        top    = limit sorted 2;
        generate group, flatten(top);
};

输入为:

Alice   third   5 
Alice   first   11 
Alice   second  10
Alice   fourth  2
Bob second  20
Bob third   18
Bob first   21
Bob fourth  8

输出为:

(Alice,Alice,first,11)
(Alice,Alice,second,10
(Bob,Bob,first,21)
(Bob,Bob,second,20)

相关问题