目前我的数据是这样来的,但我想我的数据显示排名相对于pid字段的变化顺序。我的脚本是这样的。我尝试了排名运算符和密集排名运算符,但仍然没有理想的输出。
trans_c1 = LOAD '/mypath/data_file.csv' using PigStorage(',') as (date,Product_id);
(DATE,Product id)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
最后的输出应该是这样的,秩序列随着(产品id)的变化而变化,并重置为1。在pig中有可能这样做吗?
(1,2015-01-13T18:00:40.622+05:30,B00XT)
(2,2015-01-13T18:00:40.622+05:30,B00XT)
(3,2015-01-13T18:00:40.622+05:30,B00XT)
(4,2015-01-13T18:00:40.622+05:30,B00XT)
(1,2015-01-13T18:00:40.622+05:30,B00OZ)
(2,2015-01-13T18:00:40.622+05:30,B00OZ)
(3,2015-01-13T18:00:40.622+05:30,B00OZ)
(1,2015-01-13T18:00:40.622+05:30,B00VB)
(2,2015-01-13T18:00:40.622+05:30,B00VB)
(3,2015-01-13T18:00:40.622+05:30,B00VB)
(4,2015-01-13T18:00:40.622+05:30,B00VB)
1条答案
按热度按时间hzbexzde1#
这个问题可以通过使用piggybank函数来解决
Stitch
以及Over
. 它也可以通过使用datafu的Enumerate
功能。使用piggybank函数编写脚本:
使用datafu的枚举函数编写脚本:
datafu jar文件可以从maven存储库下载:http://search.maven.org/#search%7cga%7c1%7cg%3a%22com.linkedin.datafu%22
输出:
裁判:
apache中行号函数的实现
apache pig rank函数的用法