如何在两个应用了连接的不同表上同时使用max和count函数?

66bbxpm5  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(531)
  1. //Pig Program
  2. User = LOAD 'path' USING PigStorage(',') as (id:int, reputation:int, displayname:chararray, loc:chararray, age:int);
  3. Post = LOAD 'path' USING PigStorage(',') as (id:int, post_type:int, creationdate:chararray, score:int, viewcount:int, ownerus)er_id:int, title:chararray, answercount:chararray, commentcount:chararray);
  4. JOIN User BY id, Post BY id;
  5. a = JOIN User BY id, Post BY id;
  6. DUMP a;
  7. User_Group = Group a ALL;
  8. Max_reputation = foreach User_Group Generate(User.displayname, User.reputation, Post.id), MAX(User.reputation), COUNT(Post.id);

所以基本上我将两个不同的表分组,即user和post,然后对其应用联接。
问题说明:查找具有最大信誉的用户的显示名和帖子数。
所以基本上我需要显示名称和声誉从用户
还有邮局的身份证
我想在join上应用max(user.reputation)和count(post.id),即
请帮忙。
更有用的是,应用join然后执行max和count,或者应用max和count然后执行join。

uyhoqukh

uyhoqukh1#

问题陈述:查找具有最大信誉的用户的displayname和帖子数。
首先利用关系“user”找到信誉度最大的用户的显示名
然后应用带有关系“post”的join来收集该max用户的所有post。。然后根据id应用分组并进行计数。
下面的代码将帮助您实现目标

  1. User = LOAD 'path' USING PigStorage(',') as (id:int, reputation:int, displayname:chararray, loc:chararray, age:int);
  2. Post = LOAD 'path' USING PigStorage(',') as (id:int, post_type:int, creationdate:chararray,score:int, viewcount:int, ownerus)er_id:int, title:chararray, answercount:chararray);
  3. User_grp = GROUP User BY id;
  4. User_each = FOREACH User_grp
  5. {
  6. User_order = ORDER User BY reputation DESC;
  7. User_limit = LIMIT User_order 1;
  8. User_nested = FOREACH User_limit GENERATE id,displayname;
  9. GENERATE flatten(user_nested) as (user_id,displayname);
  10. };
  11. User_join = JOIN User_each by user_id, Post by id;
  12. User_grouping = GROUP User_join BY user_id;
  13. User_output = FOREACH User_grouping GENERATE group as user_id, MAX(user_join.displayname) as displayname, COUNT(user_join.post_type) as post_cnts;
展开查看全部

相关问题