apache pig无法执行分组和计数

of1yzvn4  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(307)

我是一个新手。请帮我解决这个问题。我不知道我错在哪里。
我的数据

  1. (catA,myid_1,2014,store1,appl)
  2. (catA,myid_2,2014,store1,milk)
  3. (catA,myid_3,2014,store1,appl)
  4. (catA,myid_4,2014,store1,milk)
  5. (catA,myid_5,2015,store1,milk)
  6. (catB,myid_6,2014,store2,milk)
  7. (catB,myid_7,2014,store2,appl)

以下是预期结果

  1. (catA,2014,milk,2)
  2. (catA,2014,apple,2)
  3. (catA,2015,milk,1)
  4. (catB,2014,milk,1)
  5. (catB,2014,apple,1)

需要根据种类、年份计算食物的数量。下面是我的Pig剧本

  1. list = LOAD 'shop' USING PigStorage(',') AS (category:chararray,id:chararray,mdate:chararray,my_store:chararray,item:chararray);
  2. list_of = FOREACH list GENERATE category,SUBSTRING(mdate,0,4) as my_date,my_store,item;
  3. StoreG = GROUP list_of BY (category,my_date,my_store);
  4. result = FOREACH StoreG
  5. {
  6. food_list = FOREACH list_of GENERATE item;
  7. food_count = DISTINCT food_list;
  8. GENERATE FLATTEN(group) AS (category,my_date,my_store),COUNT(food_count);
  9. }
  10. DUMP result;

我对上述脚本的输出如下

  1. (catA,2014,store1,2)
  2. (catA,2015,store1,1)
  3. (catB,2014,store2,2)

谁能告诉我我的剧本哪里错了吗?谢谢

sd2nnvve

sd2nnvve1#

一种方法。不是最优雅但有效的例子:

  1. list = LOAD 'shop' USING PigStorage(',') AS (category:chararray,id:chararray,mdate:chararray,my_store:chararray,item:chararray);
  2. list_of = FOREACH list GENERATE category,SUBSTRING(mdate,0,4) AS my_date,my_store,item;
  3. StoreG = GROUP list_of BY (category,my_date,my_store,item);
  4. result = FOREACH StoreG GENERATE
  5. group.category AS category,
  6. group.my_date AS my_date,
  7. group.my_store AS mys_store,
  8. group.item AS item,
  9. COUNT(list_of.item) AS nb_items;
  10. DUMP result;

当我们将别名项添加到 GROUP BY 语句基本上与查找不同的项然后对它们进行计数(正如您在括号中所做的那样)是相同的。
如果您仍然想使用您的代码,您只需添加一个关系 food_list.item 以下代码:

  1. result = FOREACH StoreG
  2. {
  3. food_list = FOREACH list_of GENERATE item;
  4. food_count = DISTINCT food_list;
  5. GENERATE FLATTEN(group) AS (category,my_date,my_store),food_list.item,COUNT(food_count);
  6. }
展开查看全部
u3r8eeie

u3r8eeie2#

  1. StoreG = GROUP list_of BY (category,my_date,my_store);

应该是

  1. StoreG = GROUP list_of BY (category,my_date,item);

因为您的预期结果是按项分组,而不是按存储区分组。

相关问题