我正在使用mrjob并尝试在elasticmap reduce上运行hadoop作业,该作业一直在随机崩溃。
数据如下(以制表符分隔):
279391888 261151291 107.303163 35.468534
279391888 261115099 108.511726 35.503008
279391888 261151290 104.881560 35.278487
279391888 261151292 109.732004 35.659141
279391888 261266862 108.507754 35.434581
279391888 1687590146 59.118796 19.931201
279391888 269450882 58.909985 19.914108
底层的mapreduce非常简单:
from mrjob.job import MRJob
import numpy as np
class CitypathsSummarize(MRJob):
def mapper(self, _, line):
orig, dest, minutes, dist = line.split()
minutes = float(minutes)
dist = float(dist)
if minutes < .001:
yield "crap", 1
else:
yield orig, dist/minutes
def reducer(self, orig, speeds):
speeds = list(speeds)
mean = np.mean(speeds)
yield orig, mean
if __name__ == "__main__":
CitypathsSummarize.run()
当我运行它时,我使用以下命令,使用默认的mrjob.conf(我的密钥在环境中设置):
$ python summarize.py -r emr --ec2-instance-type c1.xlarge --num-ec2-instances 4 s3://citypaths/chicago-v4/ > chicago-v4-output.txt
当我在小数据集上运行它时,它完成得很好。当我在整个数据主体上运行它时(大约10gib的值),我会得到这样的错误(但每次都不是在同一点上!):
Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-KCPTKZR5OX6D/task-attempts/attempt_201301211911_0001_m_000151_3/syslog):
java.io.FileNotFoundException: /mnt2/var/lib/hadoop/mapred/taskTracker/jobcache/job_201301211911_0001/attempt_201301211911_0001_m_000018_4/output/spill0.out
(while reading from s3://citypaths/chicago-v4/1613640660)
Terminating job flow: j-KCPTKZR5OX6D
Traceback (most recent call last):
File "summarize.py", line 32, in <module>
CitypathsSummarize.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 545, in run
mr_job.execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 561, in execute
self.run_job()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 631, in run_job
runner.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 490, in run
self._run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1048, in _run
self._wait_for_job_to_complete()
File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1830, in _wait_for_job_to_complete
raise Exception(msg)
Exception: Job on job flow j-KCPTKZR5OX6D failed with status SHUTTING_DOWN: Shut down as step failed
Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-KCPTKZR5OX6D/task-attempts/attempt_201301211911_0001_m_000151_3/syslog):
java.io.FileNotFoundException: /mnt2/var/lib/hadoop/mapred/taskTracker/jobcache/job_201301211911_0001/attempt_201301211911_0001_m_000018_4/output/spill0.out
(while reading from s3://citypaths/chicago-v4/1613640660)
我已经试过两次了;第一次是45分钟后死亡,这次是4小时后死亡。两次都死在不同的档案里。我已经检查了两个文件,都没有任何问题。
不知怎的,它没能找到它写的泄漏文件,这让我很困惑。
编辑:
我再次运行了作业,几个小时后它又死掉了,这次是另一个错误消息。
Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-3GGW2TSIKKW5R/task-attempts/attempt_201301310511_0001_m_001810_0/syslog):
Status Code: 403, AWS Request ID: 9E9E748A55BC6A58, AWS Error Code: RequestTimeTooSkewed, AWS Error Message: The difference between the request time and the current time is too large., S3 Extended Request ID: Ky+HVYZ8RsC3l5f9N3LTwyorY9bbqEnc4tRD/r/xfAHYP/eiQrjjcpmIDNY2eoDo
(while reading from s3://citypaths/chicago-v4/1439606131)
暂无答案!
目前还没有任何答案,快来回答吧!