我希望在pig中实现以下功能。我有一套这样的样本记录。
请注意,effectivedate列有时为空,并且对于同一customerid也不同。
现在,作为输出,我希望每个customerid有一条记录,其中effectivedate是最大值。
我现在使用pig的方式是:
customerdata = LOAD 'customerdata' AS (CustomerID:chararray, CustomerName:chararray, Age:int, Gender:chararray, EffectiveDate:chararray);
--Group customer data by CustomerID
customerdata_grpd = GROUP customerdata BY CustomerID;
--From the grouped data, generate one record per CustomerID that has the maximum EffectiveDate.
customerdata_maxdate = FOREACH customerdata_grpd GENERATE group as CustID, MAX(customerdata.EffectiveDate) as MaxDate;
--Join the above with the original data so that we get the other details like CustomerName, Age etc.
joinwithoriginal = JOIN customerdata by (CustomerID, EffectiveDate), customerdata_maxdate by (CustID, MaxDate);
finaloutput = FOREACH joinwithoriginal GENERATE customerdata::CustomerID as CustomerID, CustomerName as CustomerName, Age as Age, Gender as gender, EffectiveDate as EffectiveDate;
我基本上是对原始数据进行分组,以找到具有最大生效日期的记录。然后,我再次将这些“分组”记录与原始数据集连接起来,以获得具有max effective date的相同记录,但这次我还将获得其他数据,如customername、age和gender。这个数据集非常庞大,因此这种方法需要花费大量时间。有更好的方法吗?
1条答案
按热度按时间s3fp2yjn1#
输入:
Pig脚本:
输出: