hadoop与服务器的连接突然停止

83qze16e  于 2021-06-03  发布在  Hadoop
关注(0)|答案(1)|浏览(237)

我正在运行一个hadoop进程,这需要几个小时,但由于某种原因(我不知道),它突然停止给出以下错误:

HadoopTree.mapredUtils.JobResultException: //0/0/0/0 could not be properly divided by SplitSamples
    at HadoopTree.TTrain.TreeTrainer_sp$SplitSamplesListener.stateChanged(TreeTrainer_sp.java:335)
    at HadoopTree.mapredUtils.JobResultManager.poll(JobResultManager.java:76)
    at HadoopTree.TTrain.TreeTrainer_sp.developTree(TreeTrainer_sp.java:577)
    at HadoopTree.apps.MainTrainTree.run(MainTrainTree.java:64)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
    at HadoopTree.apps.MainTrainTree.main(MainTrainTree.java:26)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
    at HadoopTree.apps.Driver.main(Driver.java:37)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:192)

我检查了这里的错误日志,发现在出现错误之前,这是在secondary namenode的日志文件中写入的正常syslog消息:

2015-02-18 08:35:11,834 INFO org.apache.hadoop.security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
2015-02-18 08:35:12,010 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=SHUFFLE, sessionId=
2015-02-18 08:35:12,014 WARN org.apache.hadoop.conf.Configuration: user.name is deprecated. Instead, use mapreduce.job.user.name
2015-02-18 08:35:12,060 WARN org.apache.hadoop.conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2015-02-18 08:35:12,089 INFO org.apache.hadoop.mapred.Task: Task:attempt_201502172051_0618_r_000003_0 is done. And is in the process of commiting
2015-02-18 08:35:12,091 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201502172051_0618_r_000003_0' done.

出现此错误时,secondaryname节点日志文件中写入了以下内容:

2015-02-18 09:55:08,962 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 0 time(s).
2015-02-18 09:55:09,963 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 1 time(s).
2015-02-18 09:55:10,963 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 2 time(s).
2015-02-18 09:55:11,964 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 3 time(s).
2015-02-18 09:55:12,965 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 4 time(s).
2015-02-18 09:55:13,965 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 5 time(s).
2015-02-18 09:55:14,966 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 6 time(s).
2015-02-18 09:55:15,966 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 7 time(s).
2015-02-18 09:55:16,967 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 8 time(s).
2015-02-18 09:55:17,968 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 9 time(s).
2015-02-18 09:55:17,968 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint: 
2015-02-18 09:55:17,968 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.net.ConnectException: Call to localhost/127.0.0.1:54310 failed on connection exception: java.net.ConnectException: Connection refused
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:932)
    at org.apache.hadoop.ipc.Client.call(Client.java:908)
    at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
    at com.sun.proxy.$Proxy4.getEditLogSize(Unknown Source)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)
    at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:417)
    at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:207)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1025)
    at org.apache.hadoop.ipc.Client.call(Client.java:885)
    ... 4 more

2015-02-18 10:00:18,970 INFO org.apache.hadoop.ipc.Client: Retrying connect

我在名称节点日志文件中也发现了此错误:

java.io.IOException: File /jobtracker/jobsInfo/job_201502172051_0597.info could only be replicated to 0 nodes, instead of 1
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1448)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:690)
    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:342)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1350)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1346)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1344)
dy2hfwbg

dy2hfwbg1#

查看namenode中的异常,似乎namenode没有获得足够的datanode(最少1个)来复制文件job\u 201502172051\u 0597.info的块。检查datanode日志以查看是否存在任何问题。

相关问题