通过连接两个数据集查找平均值

1rhkuytd  于 2021-06-25  发布在  Pig
关注(0)|答案(1)|浏览(302)

我有两个数据集,

EmployeeDetail(data set 1):- 
   id  
   name
   gender
   location 

SalaryDetail(data set 2):-
   id
   salary

我需要加入两者,并找出平均工资的男性和女性在每个位置。所以我试着遵循代码。

EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as 
(id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as 
(id:int, salary:float);                                     
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by
id;                                                                         
GroupedByLocation = group JoinedEmpDetail by location;
AverageSalary = foreach GroupedByLocation { 
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group, 
AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};

但这是错误的

<line 6, column 22>  Syntax error, unexpected symbol at or near 
'JoinedEmpDetail'

有谁能帮我一下,我在哪里犯的错,或者如何正确地做?
为了更清楚地了解我的要求,我提供了一些示例数据集。
empdetail.txt文件

1   Biswa   Male    Bangalore
12  Bratati Mahapatra   Female  Chennai
2   Bibhu kalyan    Male    Bangalore
3   Chinta  Male    Mumbai
10  Amrit Anand Male    Bangalore
11  Sateesh panda   Male    Bangalore
4   Kirti Kumar Male    Mumbai
6   Shruthi Female  Chennai
7   Vijay   Male    Chennai
5   Bibhu   Male    Chennai
9   Bratati  Mohanty    Female  Bangalore
8   Rupa Mahapatra  Female  Bangalore
13  Salini  Female  Mumbai
14  Priyanka Chopra Female  Mumbai

empsalary.txt文件

1   10000
12  12000
2   15900
3   9000
10  8000
11  13400
4   7600
6   22000
7   17000
5   16800
9   9800
8   10000
13  11000
14  12500

我需要的最终结果是:

Mumbai male <avgsalary amount>
Mumbai female <avgsalary amount>
Bangalore male <avgsalary amount>
Bangalore female <avgsalary amount>
Chennai male <avgsalary amount>
Chennai female <avgsalary amount>
sulc1iza

sulc1iza1#

你可以用简单的方法解决这个问题 foreach stmt 所以不要使用嵌套的foreach stmt。 Group command 不会在嵌套的foreach内工作,其限制在pig内。在嵌套的foreach中只允许几个命令(cross、distinct、filter、foreach、limit和order by)。
你能这样改剧本吗?

EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);                                     
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
GroupedByLocation = group JoinedEmpDetail by (location,gender);
AverageSalary = FOREACH GroupedByLocation GENERATE FLATTEN(group),AVG(JoinedEmpDetail.SalaryDetail::salary);
DUMP AverageSalary;

输出:

(Mumbai,Male,8300.0)
(Mumbai,Female,11750.0)
(Chennai,Male,16900.0)
(Chennai,Female,17000.0)
(Bangalore,Male,11825.0)
(Bangalore,Female,9900.0)

相关问题