我正在尝试在我们自己的vpc上启动一个ec2集群。我可以在aws中使用命令来启动它,但是如果我指定我们自己的vpc/子网,它将无法启动集群(因此,我们不是在讨论在它上面运行的作业--我们是在讨论启动默认集群本身)。
显然,这一定与sub和aws的hadoop有关(尽管这不是常见的“在主路由表中找不到到到internetgateway的路由”错误)。
我无法从日志中确定原因。这在命令行和使用awsweb控制台上都会发生。
我们没有在集群上执行任何自定义操作/环境。
下面是子网的详细信息
Destination Target
10.0.0.0/16 local
0.0.0.0/0 igw-2235d249
10.3.0.0/16 eni-b989b091
下面是用于启动的命令行(删除--subnet将允许命令成功,但我们需要它在此专有网络上访问某些特定资源):
elastic-mapreduce --create
--alive
--name "BMVE on Subnet 0BF3BB23"
--instance-type m1.medium
--num-instances 3
--key-pair hadoop
--subnet subnet-0bf3bb23
--visible-to-all-users true
master.log文件:
2014-03-31 18:24:48,848 INFO i-3e4ce71d: new instance started
2014-03-31 18:24:49,920 INFO i-3e4ce71d: bootstrap action 1 completed
2014-03-31 18:35:40,352 ERROR i-3e4ce71d: failed to start. hadoop JobTracker/NameNode process failed to launch.
1/控制器日志:
2014-03-31T18:24:48.849Z INFO Fetching file 's3://elasticmapreduce/bootstrap-actions/configure-hadoop'
2014-03-31T18:24:49.408Z INFO Working dir /mnt/var/lib/bootstrap-actions/1
2014-03-31T18:24:49.408Z INFO Executing /mnt/var/lib/bootstrap-actions/1/configure-hadoop --site-key-value io.file.buffer.size=65536
2014-03-31T18:24:49.917Z INFO Execution ended with ret val 0
2014-03-31T18:24:49.918Z INFO Execution succeeded
1/标准日志:
1/系统日志:
Processing default file /home/hadoop/conf/hadoop-site.xml with overwrite io.file.buffer.size=65536
/home/hadoop/conf/hadoop-site.xml does not exist, assuming empty configuration
'io.file.buffer.size': default does not have key, appending value '65536'
Saved /home/hadoop/conf/hadoop-site.xml with overwrites. Original saved to /home/hadoop/conf/hadoop-site.xml.old
守护进程jobtacker日志(已筛选警告|错误):
2014-03-31 18:25:00,906 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl (main): Source name ugi already exists!
. . .
2014-03-31 18:25:08,059 WARN org.apache.hadoop.hdfs.DFSClient (Thread-18): DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1569)
. . .
2014-03-31 18:25:08,059 WARN org.apache.hadoop.hdfs.DFSClient (Thread-18): Error Recovery for block null bad datanode[0] nodes == null
2014-03-31 18:25:08,060 WARN org.apache.hadoop.hdfs.DFSClient (Thread-18): Could not get block locations. Source file "/mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info" - Aborting...
2014-03-31 18:25:08,060 WARN org.apache.hadoop.mapred.JobTracker (main): Writing to file hdfs://10.0.7.65:9000/mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info failed!
2014-03-31 18:25:08,060 WARN org.apache.hadoop.mapred.JobTracker (main): FileSystem is not ready yet!
2014-03-31 18:25:08,084 WARN org.apache.hadoop.mapred.JobTracker (main): Failed to initialize recovery manager.
. . .
2014-03-31 18:35:32,239 WARN org.apache.hadoop.hdfs.DFSClient (Thread-125): Error Recovery for block null bad datanode[0] nodes == null
2014-03-31 18:35:32,239 WARN org.apache.hadoop.hdfs.DFSClient (Thread-125): Could not get block locations. Source file "/mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info" - Aborting...
2014-03-31 18:35:32,239 WARN org.apache.hadoop.mapred.JobTracker (main): Writing to file hdfs://10.0.7.65:9000/mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info failed!
2014-03-31 18:35:32,239 WARN org.apache.hadoop.mapred.JobTracker (main): FileSystem is not ready yet!
2014-03-31 18:35:32,244 WARN org.apache.hadoop.mapred.JobTracker (main): Failed to initialize recovery manager.
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1569)
守护程序namenode日志(再次筛选):
2014-03-31 18:25:07,693 INFO org.apache.hadoop.security.ShellBasedUnixGroupsMapping (IPC Server handler 1 on 9000): add hadoop to shell userGroupsCache
2014-03-31 18:25:08,042 ERROR org.apache.hadoop.security.UserGroupInformation (IPC Server handler 11 on 9000): PriviledgedActionException as:hadoop cause:java.io.IOException: File /mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
2014-03-31 18:25:08,043 INFO org.apache.hadoop.ipc.Server (IPC Server handler 11 on 9000): IPC Server handler 11 on 9000, call addBlock(/mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info, DFSClient_678715989, null) from 10.0.7.65:36607: error: java.io.IOException: File /mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
java.io.IOException: File /mnt/var/lib/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1
任何协助都将不胜感激。
2条答案
按热度按时间ldxq2e6h1#
这似乎与我们公司专有网络的dns性质有关——我们不得不创建一个额外的专有网络,然后将数据库资源克隆到其中(不知道为什么——我对专有网络管理员的访问受到限制,所以我相信管理员所说的话)。
上面的错误是相当迟钝的,所以希望==dns问题能帮助其他人。
一些参考资料:
http://docs.aws.amazon.com/elasticmapreduce/latest/developerguide/emr-troubleshoot-error-vpc.html#emr-dhcp错误疑难解答
http://docs.aws.amazon.com/amazonvpc/latest/userguide/vpc-dns.html
vpc上的hadoop要求vpc的dhcp选项配置为默认ec2设置,例如“使用amazon dns服务器”和“在dns中注册主机”。如果不使用amazondns服务器,hadoop集群将无法相互联系,启动集群将失败。这与我们通过dhcp选项推送自定义dns服务器信息的专有网络设置不兼容。
ykejflvf2#
对我来说很好。在vpc中,您可以尝试将
Route Table
到vpc中的子网: