pig加入cloudera虚拟机

wixjitnu  于 2021-06-03  发布在  Hadoop
关注(0)|答案(1)|浏览(449)

我尝试在apachepig中执行一个简单的join。我使用的数据集来自http://www.dtic.upf.edu/~ocelma/musicrecommendationdataset/lastfm-1k.html
我在Pig壳里就是这么做的:

profiles = LOAD '/user/hadoop/tests/userid-profile.tsv' AS (id,gender,age,country, dreg);
songs = LOAD '/user/hadoop/tests/userid-timestamp-artid-artname-traid-traname.tsv' AS (userID, timestamp, artistID, artistName, trackID, trackName);
prDACH = filter profiles by country=='Germany' or country=='Austria' or country=='Switzerland';
songsDACH = join songs by userID, prDACH by id;
dump songsDACH;

这是日志的一部分:

2013-04-20 01:01:33,885 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2013-04-20 01:02:39,802 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 2% complete
2013-04-20 01:13:23,943 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 37% complete
2013-04-20 01:14:48,704 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 39% complete
2013-04-20 01:15:40,166 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 41% complete
2013-04-20 01:15:41,142 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2013-04-20 01:15:41,143 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_1366403809583_0023 has failed! Stop running all dependent jobs
2013-04-20 01:15:41,143 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2013-04-20 01:15:43,117 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1366403809583_0023_m_000019_0 Info:Container killed by the ApplicationMaster.

当我使用一个小样本的歌曲,然后加入执行没有任何问题。有什么想法吗?
看起来hdfs设置有问题,因为我可以使用歌曲数据的子集(100000个样本)执行连接。
ps我正在使用cloudera演示虚拟机。

r6vfmomb

r6vfmomb1#

您应该查看任务尝试的日志:将浏览器指向作业跟踪器( http://[your-jobtracker-node]:50030 ),查找失败的作业,查找失败的任务尝试,浏览日志,您将能够看到实际的异常-我怀疑这可能与任务堆大小配置有关,但您必须先查看异常,然后提出解决方案(配置更改等)。

相关问题