嗨,我对pig编程比较陌生,遇到了一个我很难解决的问题:
我有两个数据集
答:(accountid:chararray, title:chararray, genre:chararray)
("A123", "Harry Potter", "Action/Adventure")
("A123", "Sherlock Holmes", "Mystery")
("B456", "James Bond", "Action")
("B456", "Hamlet", "Drama")
第二:(accountid:chararray, title:chararray, genre:chararray)
("B456", "Percy Jackson", "Action/Adventure")
("B456", "Elementary", "Mystery")
("A123", "Divergent", "Action")
("A123", "Downton Abbey", "Drama")
我想要的结果应该是
(accountid:charray, {(),(),...}
(A123, {("A123", "Harry Potter", "Action/Adventure"),
("A123", "Sherlock Holmes", "Mystery"),
("A123", "Divergent", "Action"),
("A123", "Downton Abbey", "Drama")
})
(B456, {("B456", "James Bond", "Action"),
("B456", "Hamlet", "Drama"),
("B456", "Percy Jackson", "Action/Adventure"),
("B456", "Elementary", "Mystery")
})
目前我正在做:
ans=按accountid加入a,按accountid加入b;
但结果看起来
架构:(accountid:chararray, {(accountid:chararray, title:chararray, genre:chararray), ...})
(B456, {("B456", "James Bond", "Action"),
("B456", "Hamlet", "Drama")}
"B456", {
("B456", "Percy Jackson", "Action/Adventure"),
("B456", "Elementary", "Mystery")
})
你知道我做错了什么吗。
1条答案
按热度按时间kb5ga3dv1#
试试这个:
join只是将两个关系中的行按原样连接起来。你想完成两件事:
对每个关系中属于同一帐户的所有行进行分组
连接两个“分组”关系(仅获取两个关系中存在的ID)
这两个动作由cogroup执行。我读到的最好的解释是:http://joshualande.com/cogroup-in-pig/
您的关系现在将包含组键(id)和两个包(一个来自a,一个来自b),每个包包含原始关系中的行;将它们“合并”为一个包的方法是使用datafu.jar中的bagconcat函数。datafu是一个Pig自定义项库,里面有很多好东西。你可以在这里阅读:http://datafu.incubator.apache.org/docs/datafu/guide/bag-operations.html