pig如何基于一个不相等的条件组合两个文件

vuv7lop3  于 2021-07-15  发布在  Pig
关注(0)|答案(1)|浏览(409)

我正在努力寻找一年中在大多数球队效力的球员。我有一个文件的模式是playerid,yearid,teamid。我带了两次文件来尝试加入playerid和yearid相等但teamid不相等的地方。我在Pig里怎么样?我可以在join语句中执行<>吗?我需要对他们进行分组和比较吗?我知道我可以加入sql,因为playerid和yearid相等,teamid不相等,但不知道如何在pig中这样做。
我试过了,但语法不对”

batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS

    (id:chararray,yearid:int, teamid:chararray);
batters1 =  LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS                `   (id:chararray,yearid:int, teamid:chararray);
batter_fltr = FILTER batters BY  (yearid > 0) AND (teamid> ' ');
batter1_fltr = FILTER batters1 BY  (yearid>0) AND (teamid> ' ');
multi_playr = JOIN batter_fltr BY (yearid,id), batter1_fltr BY(yearid,id) ,LEFT OUTER BY(teamid);
wvyml7n5

wvyml7n51#

你想找到一年中在最多球队效力的球员。因此,你应该按球员和年份分组,然后你可以计算每个球员每年的球队数量。最后,按计数降序排列数据-第一个结果就是你的答案。无需加载两次数据或执行联接。

batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' USING PigStorage(',') AS
    (id:chararray, yearid:int, teamid:chararray);

-- Apply filtering as needed here

teams_per_year = FOREACH (GROUP batters BY (id, yearid))
    GENERATE
    group.id AS id,
    group.yearid AS yearid,
    COUNT(batters.teamid) AS num_teams;

ordered_results = ORDER teams_per_year BY num_teams DESC;

DUMP ordered_results;

如果需要不同数量的团队,请添加嵌套的 DISTINCT :

teams_per_year = FOREACH (GROUP batters BY (id, yearid)) {
    dist_teams = DISTINCT batters.teamid;
    GENERATE
    group.id AS id,
    group.yearid AS yearid,
    COUNT(dist_teams) AS num_teams;
}

相关问题