我有一个用户和元素的数据集,我想在其中找到至少有一个重叠元素的任何用户对。我的数据结构如下所示:
id element
--------------
1 a
1 b
1 b
2 b
3 a
4 c
在本例中,我将生成以下元组:
(1,2) // both have element "b" in common
(1,3) // both have element "a" in common
我已经编写了下面的pig脚本,它在小规模下工作,但是当我甚至在100万行(~500mb)的情况下,我在1.5小时后就终止了工作,因为它生成了近40gb的数据,这似乎与我要完成的任务有点不成比例。我是新来的Pig,所以我希望这可以优化一点。任何帮助都将不胜感激。
-- load the data
mydata = LOAD '/path/to/my/data' USING PigStorage('\t') AS (user:int, element:chararray);
-- generate a copy to do a self join with
A = FOREACH mydata GENERATE user as user_2, element as element_2;
-- join them based on common tags
B = JOIN mydata BY element, A by element_2;
-- we only want the mapping in one direction, e.g. (1,2) is the same as (2,1)
C = FILTER B BY user < user_2;
-- we're only interested in the user ids
D = FOREACH C generate user, user_2;
-- remove any duplicate tuples
E = DISTINCT D;
STORE E INTO '/path/to/output';
注意:这是我上一个问题的后续,hadoop pig使用稍微不同的方法连接任何匹配的元组值
1条答案
按热度按时间yuvru6vn1#
如果您的输入包含重复的,那么最好先过滤掉重复的,因为它们会导致组合爆炸。
你可以尝试的另一件事是分组而不是连接。您可以立即得到结果,但不是作为一个成对的列表:
| mydata | user:int | element:chararray |
| | 1 | a |
| | 3 | a |
| A | group:chararray | mydata:bag{:tuple(user:int,element:chararray)} |
| | a | {(1, a), (3, a)} |
| B | org.apache.pig.builtin.totuple_group_13:tuple(group:chararray,:bag{:tuple(user:int)}) |
| | (a, {(1), (3)}) |
C = foreach B {
X = foreach $0 generate $0.$1;
Y = foreach $0 generate $0.$1;
F = CROSS X, Y ;
generate $0.group, flatten(F);
};
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [POProject (Name: Project[bag][1] - scope-131 Operator Key: scope-131) children: null at []]: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:338)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCross.accumulateData(POCross.java:202)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCross.getNextTuple(POCross.java:116)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNextDataBag(PhysicalOperator.java:385)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:590)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNextDataBag(PORelationToExprProject.java:106)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:309)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
at org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:236)
at org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257)
at org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:238)
at org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:103)
at org.apache.pig.pen.LineageTrimmingVisitor.(LineageTrimmingVisitor.java:98)
at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:166)
at org.apache.pig.PigServer.getExamples(PigServer.java:1238)
at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:831)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:802)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:381)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:541)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNextTuple(POProject.java:476)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:592)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNextDataBag(POProject.java:247)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:309)
... 35 more
2014-03-20 01:28:57,235 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Encountered IOException. ExecException