count聚合在pig中不起作用

cuxqih21  于 2021-06-24  发布在  Pig
关注(0)|答案(1)|浏览(346)

我是apache pig的新手,我在pig编程方面遇到了一个问题。我想数一数每个经理手下的员工人数。但是我认为我没有从这个代码中得到正确的输出。我需要你的帮助。
以下是源数据文件:

  1. 7369,SMITH,CLERK,7902,1980-12-17,800.00,NULL,20
  2. 7499,ALLEN,SALESMAN,7698,1981-02-20,1600.00,300.00,30
  3. 7521,WARD,SALESMAN,7698,1981-02-22,1250.00,500.00,30
  4. 7566,JONES,MANAGER,7839,1981-04-02,2975.00,NULL,20
  5. 7654,MARTIN,SALESMAN,7698,1981-09-28,1250.00,1400.00,30
  6. 7698,BLAKE,MANAGER,7839,1981-05-01,2850.00,NULL,30
  7. 7782,CLARK,MANAGER,7839,1981-06-09,2450.00,NULL,10
  8. 7788,SCOTT,ANALYST,7566,1982-12-09,3000.00,NULL,20
  9. 7839,KING,PRESIDENT,NULL,1981-11-17,5000.00,NULL,10
  10. 7844,TURNER,SALESMAN,7698,1981-09-08,1500.00,0.00,30
  11. 7876,ADAMS,CLERK,7788,1983-01-12,1100.00,NULL,20
  12. 7900,JAMES,CLERK,7698,1981-12-03,950.00,NULL,30
  13. 7902,FORD,ANALYST,7566,1981-12-03,3000.00,NULL,20
  14. 7934,MILLER,CLERK,7782,1982-01-23,1300.00,NULL,10

代码如下:

  1. data_mgr = load '/users/Desktop/Employees.rtf' using
  2. PigStorage(',') as (empno:int, empname:chararray, job:chararray,
  3. mgr:int, hiredate:chararray, sal:float, comm:float, dept:int);
  4. data_emp = load '/users/Desktop/Employees.rtf' using
  5. PigStorage(',') as
  6. (empno:int, empname:chararray, job:chararray, mgr:int,
  7. hiredate:chararray, sal:float, comm:float, dept:int);
  8. joined = join data_mgr by mgr, data_emp by empno;
  9. select1 = foreach joined generate data_mgr::empno as mgrid,
  10. data_mgr::empname as mgrname, data_emp::empno as empno;
  11. grouped = group select1 by ($0, $1);
  12. select2 = foreach grouped generate group, COUNT(select1) as
  13. no_of_reportees;
  14. ordered = order select2 by no_of_reportees desc;
  15. dump ordered;
wlsrxk51

wlsrxk511#

试试这个,

  1. emp_data = LOAD '/users/Desktop/Employees.rtf' USING PigStorage(',') AS (empno:int, empname:chararray, job:chararray, mgrid:int, hiredate:chararray, sal:float, comm:float, dept:int);
  2. mgr_group = GROUP emp_data BY mgrid;
  3. emo_count = FOREACH mgr_grp GENERATE group AS mgr_id, COUNT(emp_data) AS Count;
  4. emp_count_ordered = ORDER emp_count BY Count DESC;
  5. DUMP emp_count_ordered;

注意:您可以对初始数据集进一步使用join操作来获取mgr名称。
你是说这样的事吗(虽然我没有测试)

  1. data_emp = load '/users/Desktop/Employees.rtf' using PigStorage(',') as (empno:int, empname:chararray, job:chararray, mgrid:int, hiredate:chararray, sal:float, comm:float, dept:int);
  2. data_mgr = load '/users/Desktop/Employees.rtf' using PigStorage(',') as (empno:int, empname:chararray, job:chararray, mgrid:int, hiredate:chararray, sal:float, comm:float, dept:int);
  3. emp_mgr_join = join data_emp by empno, data_mgr by mgrid;
  4. emp_mgr_join_sub = foreach emp_mgr_join generate data_mgr::mgrid as mgrid, data_mgr::empname as mgrname, data_emp::empno as empno;
  5. emp_mgr_grouped = group emp_mgr_join_sub by mgrid;
  6. emp_mgr_count = foreach emp_mgr_grouped generate group AS mgr_id, emp_mgr_join_sub.mgrname as mgr_name, COUNT(emp_mgr_join_sub) as no_of_reportees;
  7. ordered = order emp_mgr_count by no_of_reportees desc;
  8. dump ordered;
展开查看全部

相关问题