apache pig-group-by

r8uurelv  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(573)

我希望在pig中实现以下功能。我有一套这样的样本记录。

请注意,effectivedate列有时为空,并且对于同一customerid也不同。
现在,作为输出,我希望每个customerid有一条记录,其中effectivedate是最大值。

我现在使用pig的方式是:

  1. customerdata = LOAD 'customerdata' AS (CustomerID:chararray, CustomerName:chararray, Age:int, Gender:chararray, EffectiveDate:chararray);
  2. --Group customer data by CustomerID
  3. customerdata_grpd = GROUP customerdata BY CustomerID;
  4. --From the grouped data, generate one record per CustomerID that has the maximum EffectiveDate.
  5. customerdata_maxdate = FOREACH customerdata_grpd GENERATE group as CustID, MAX(customerdata.EffectiveDate) as MaxDate;
  6. --Join the above with the original data so that we get the other details like CustomerName, Age etc.
  7. joinwithoriginal = JOIN customerdata by (CustomerID, EffectiveDate), customerdata_maxdate by (CustID, MaxDate);
  8. finaloutput = FOREACH joinwithoriginal GENERATE customerdata::CustomerID as CustomerID, CustomerName as CustomerName, Age as Age, Gender as gender, EffectiveDate as EffectiveDate;

我基本上是对原始数据进行分组,以找到具有最大生效日期的记录。然后,我再次将这些“分组”记录与原始数据集连接起来,以获得具有max effective date的相同记录,但这次我还将获得其他数据,如customername、age和gender。这个数据集非常庞大,因此这种方法需要花费大量时间。有更好的方法吗?

s3fp2yjn

s3fp2yjn1#

输入:

  1. 1,John,28,M,1-Jan-15
  2. 1,John,28,M,1-Feb-15
  3. 1,John,28,M,
  4. 1,John,28,M,1-Mar-14
  5. 2,Jane,25,F,5-Mar-14
  6. 2,Jane,25,F,5-Jun-15
  7. 2,Jane,25,F,3-Feb-14

Pig脚本:

  1. customer_data = LOAD 'customer_data.csv' USING PigStorage(',') AS (id:int,name:chararray,age:int,gender:chararray,effective_date:chararray);
  2. customer_data_fmt = FOREACH customer_data GENERATE id..gender,ToDate(effective_date,'dd-MMM-yy') AS date, effective_date;
  3. customer_data_grp_id = GROUP customer_data_fmt BY id;
  4. req_data = FOREACH customer_data_grp_id {
  5. customer_data_ordered = ORDER customer_data_fmt BY date DESC;
  6. req_customer_data = LIMIT customer_data_ordered 1;
  7. GENERATE FLATTEN(req_customer_data.id) AS id,
  8. FLATTEN(req_customer_data.name) AS name,
  9. FLATTEN(req_customer_data.gender) AS gender,
  10. FLATTEN(req_customer_data.effective_date) AS effective_date;
  11. };

输出:

  1. (1,John,M,1-Feb-15)
  2. (2,Jane,F,5-Jun-15)
展开查看全部

相关问题