Pig需要找到马克斯

mmvthczy  于 2021-07-15  发布在  Pig
关注(0)|答案(1)|浏览(458)

我是一个新的Pig和工作的问题,我需要找到在这个数据集的球员与最大重量。以下是数据示例:

  1. id, weight,id,year, triples
  2. (bayja01,210,bayja01,2005,6)
  3. (crawfca02,225,crawfca02,2005,15)
  4. (damonjo01,205,damonjo01,2005,6)
  5. (dejesda01,190,dejesda01,2005,6)
  6. (eckstda01,170,eckstda01,2005,7)

这是我的Pig剧本:

  1. batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' using PigStorage(',');
  2. realbatters = FILTER batters BY $1==2005;
  3. triphitters = FILTER realbatters BY $9>5;
  4. tripids = FOREACH triphitters GENERATE $0 AS id,$1 AS YEAR, $9 AS Trips;
  5. names = LOAD 'hdfs:/user/maria_dev/pigtest/Master.csv'
  6. using PigStorage(',');
  7. weights = FOREACH names GENERATE $0 AS id, $16 AS weight;
  8. get_ids = JOIN weights BY (id), tripids BY(id);
  9. wts = FOREACH get_ids GENERATE MAX(get_ids.weight)as wgt;
  10. DUMP wts;

当然,倒数第二行行不通。它告诉我我必须使用显式演员阵容。我已经弄清楚了过滤等-但无法弄清楚如何得到最终答案。

xlpyo6sf

xlpyo6sf1#

这个 MAX pig中的函数需要一袋值,并返回袋中的最高值。要创建包,必须首先 GROUP 您的数据:

  1. get_ids = JOIN weights BY id, tripids BY id;
  2. -- Drop columns we no longer need and rename for ease
  3. just_ids_weights = FOREACH get_ids GENERATE
  4. weights::id AS id,
  5. weights:: weight AS weight;
  6. -- Group the data by id value
  7. gp_by_ids = GROUP just_ids_weights BY id;
  8. -- Find maximum weight by id
  9. wts = FOREACH gp_by_ids GENERATE
  10. group AS id,
  11. MAX(just_ids_weights.weight) AS wgt;

如果您想要所有数据的最大重量,可以使用 GROUP ALL :

  1. gp_all = GROUP just_ids_weights ALL;
  2. was = FOREACH gp_all GENERATE
  3. MAX(just_ids_weights.weight) AS wgt;
展开查看全部

相关问题