random java.io.filenotfoundexception jobcache错误

rqdpfwrv  于 2021-06-04  发布在  Hadoop
关注(0)|答案(0)|浏览(245)

我正在使用mrjob并尝试在elasticmap reduce上运行hadoop作业,该作业一直在随机崩溃。
数据如下(以制表符分隔):

279391888       261151291       107.303163      35.468534
279391888       261115099       108.511726      35.503008
279391888       261151290       104.881560      35.278487
279391888       261151292       109.732004      35.659141
279391888       261266862       108.507754      35.434581
279391888       1687590146      59.118796       19.931201
279391888       269450882       58.909985       19.914108

底层的mapreduce非常简单:

from mrjob.job import MRJob
import numpy as np

class CitypathsSummarize(MRJob):
  def mapper(self, _, line):
    orig, dest, minutes, dist = line.split()
    minutes = float(minutes)
    dist = float(dist)
    if minutes < .001:
      yield "crap", 1
    else:
      yield orig, dist/minutes

  def reducer(self, orig, speeds):
    speeds = list(speeds)
    mean = np.mean(speeds)
    yield orig, mean

if __name__ == "__main__":
  CitypathsSummarize.run()

当我运行它时,我使用以下命令,使用默认的mrjob.conf(我的密钥在环境中设置):

$ python summarize.py -r emr --ec2-instance-type c1.xlarge --num-ec2-instances 4 s3://citypaths/chicago-v4/ > chicago-v4-output.txt

当我在小数据集上运行它时,它完成得很好。当我在整个数据主体上运行它时(大约10gib的值),我会得到这样的错误(但每次都不是在同一点上!):

Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-KCPTKZR5OX6D/task-attempts/attempt_201301211911_0001_m_000151_3/syslog):
java.io.FileNotFoundException: /mnt2/var/lib/hadoop/mapred/taskTracker/jobcache/job_201301211911_0001/attempt_201301211911_0001_m_000018_4/output/spill0.out
(while reading from s3://citypaths/chicago-v4/1613640660)
Terminating job flow: j-KCPTKZR5OX6D
Traceback (most recent call last):
  File "summarize.py", line 32, in <module>
    CitypathsSummarize.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 545, in run
    mr_job.execute()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 561, in execute
    self.run_job()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 631, in run_job
    runner.run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 490, in run
    self._run()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1048, in _run
    self._wait_for_job_to_complete()
  File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1830, in _wait_for_job_to_complete
    raise Exception(msg)
Exception: Job on job flow j-KCPTKZR5OX6D failed with status SHUTTING_DOWN: Shut down as step failed
Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-KCPTKZR5OX6D/task-attempts/attempt_201301211911_0001_m_000151_3/syslog):
java.io.FileNotFoundException: /mnt2/var/lib/hadoop/mapred/taskTracker/jobcache/job_201301211911_0001/attempt_201301211911_0001_m_000018_4/output/spill0.out
(while reading from s3://citypaths/chicago-v4/1613640660)

我已经试过两次了;第一次是45分钟后死亡,这次是4小时后死亡。两次都死在不同的档案里。我已经检查了两个文件,都没有任何问题。
不知怎的,它没能找到它写的泄漏文件,这让我很困惑。
编辑:
我再次运行了作业,几个小时后它又死掉了,这次是另一个错误消息。

Probable cause of failure (from s3://mrjob-093c9ef589d9f262/tmp/logs/j-3GGW2TSIKKW5R/task-attempts/attempt_201301310511_0001_m_001810_0/syslog):
Status Code: 403, AWS Request ID: 9E9E748A55BC6A58, AWS Error Code: RequestTimeTooSkewed, AWS Error Message: The difference between the request time and the current time is too large., S3 Extended Request ID: Ky+HVYZ8RsC3l5f9N3LTwyorY9bbqEnc4tRD/r/xfAHYP/eiQrjjcpmIDNY2eoDo
(while reading from s3://citypaths/chicago-v4/1439606131)

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题