使用pig拉丁语选择count distinct

yqlxgs2m  于 2021-06-21  发布在  Pig
关注(0)|答案(3)|浏览(405)

我需要你的帮助。我只得到一张唱片。我选择了两列并对另一列进行计数(distinct),同时还使用where like子句查找特定描述(desc)。
下面是我的sql和pig代码。

/*
    For example in sql:
    select domain, count(distinct(segment)) as segment_cnt
    from table
    where desc='ABC123'
    group by domain
    order by segment_count desc;
    */

    A = LOAD 'myoutputfile' USING PigStorage('\u0005')
            AS (
                domain:chararray,
                segment:chararray,
                desc:chararray
                );
B = filter A by (desc=='ABC123');
C = foreach B generate domain, segment;
D = DISTINCT C;
E = group D all;
F = foreach E generate group, COUNT(D) as segment_cnt;
G = order F by segment_cnt DESC;
oo7oh9g9

oo7oh9g91#

您可以更好地将其定义为宏:

DEFINE DISTINCT_COUNT(A, c) RETURNS dist {
  temp = FOREACH $A GENERATE $c;                                                                                                                                                      
  dist = DISTINCT temp;                                                                                                                                                               
  groupAll = GROUP dist ALL;                                                                                                                                                          
  $dist = FOREACH groupAll GENERATE COUNT(dist);                                                                                                                                      
}

用法:
X = LOAD 'data' AS (x: int); Y = DISTINCT_COUNT(X, x); 如果你需要用在 FOREACH 相反,最简单的方法是: ...GENERATE COUNT(Distinct(x))... 在12号Pig身上测试。

xzv2uavs

xzv2uavs2#

如果您不想依赖任何组,请使用以下命令:

G = FOREACH (GROUP A ALL){
unique = DISTINCT A.field;
GENERATE COUNT(unique) AS ct;
};

这只会给你一个数字。

ndasle7k

ndasle7k3#

您可以在每个域上分组,然后使用嵌套的foreach语法计算每个组中不同元素的数量:

D = group C by domain;
E = foreach D { 
    unique_segments = DISTINCT C.segment;
    generate group, COUNT(unique_segments) as segment_cnt;
};

相关问题