spark作业无法从mesos集群获取资源

m0rkklqb  于 2021-06-26  发布在  Mesos
关注(0)|答案(0)|浏览(329)

我正在使用spark作业服务器(sjs)创建上下文并提交作业。
我的集群包括4台服务器。

master1: 10.197.0.3
master2: 10.197.0.4
master3: 10.197.0.5
master4: 10.197.0.6

但只有master1拥有公共ip。
首先,我为master1、master3和master3设置了zookeeper,zookeeper id从1到3。我意愿使用大师1,大师2,大师3成为集群的大师。那是什么意思 quorum=2 我准备三个大师赛。zk连接是 zk://master1:2181,master2:2181,master3:2181/mesos 每台服务器我也启动mesos slave,所以我有4个slave和3个master。

正如你所看到的,所有的奴隶都是奴隶。但有趣的是,当我创建一个作业来运行时,它无法获取资源。

从日志中我看到它继续拒绝报价。这是主人的日志。

I0523 15:01:00.116981 32513 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4264 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
I0523 15:01:00.117086 32513 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4265 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
I0523 15:01:01.460502 32508 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (914)@127.0.0.1:5050
I0523 15:01:02.117753 32510 master.cpp:5324] Sending 1 offers to framework dc18c89f-d802-404b-9221-71f0f15b096f-0000 (sql_context) at scheduler-9b4637cf-4b27-4629-9a73-6019443ed30b@10.197.0.3:28765
I0523 15:01:02.118099 32510 master.cpp:5324] Sending 1 offers to framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
I0523 15:01:02.119299 32508 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4266 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0000 (sql_context) at scheduler-9b4637cf-4b27-4629-9a73-6019443ed30b@10.197.0.3:28765
I0523 15:01:02.119858 32515 master.cpp:3641] Processing DECLINE call for offers: [ dc18c89f-d802-404b-9221-71f0f15b096f-O4267 ] for framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765
I0523 15:01:02.900946 32509 http.cpp:312] HTTP GET for /master/state from 10.197.0.3:35778 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36' with X-Forwarded-For='113.161.38.181'
I0523 15:01:03.118147 32514 master.cpp:5324] Sending 1 offers to framework dc18c89f-d802-404b-9221-71f0f15b096f-0001 (sql_context-1) at scheduler-f5196abd-f420-48c6-b2fe-0306595601d4@10.197.0.3:28765

我查了我的一个奴隶

W0523 14:53:15.487599 32681 status_update_manager.cpp:475] Resending status update TASK_FAILED (UUID: 3c3a022c-2032-4da1-bbab-c367d46e07de) for task driver-20160523111535-0003 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0019
W0523 14:53:15.487773 32681 status_update_manager.cpp:475] Resending status update TASK_FAILED (UUID: cfb494b3-6484-4394-bd94-80abf2e11ee8) for task driver-20160523112724-0001 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0020
I0523 14:53:15.487820 32680 slave.cpp:3400] Forwarding the update TASK_FAILED (UUID: 3c3a022c-2032-4da1-bbab-c367d46e07de) for task driver-20160523111535-0003 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0019 to master@10.197.0.3:5050
I0523 14:53:15.488008 32680 slave.cpp:3400] Forwarding the update TASK_FAILED (UUID: cfb494b3-6484-4394-bd94-80abf2e11ee8) for task driver-20160523112724-0001 of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0020 to master@10.197.0.3:5050
I0523 15:02:24.120436 32680 http.cpp:190] HTTP GET for /slave(1)/state from 113.161.38.181:63097 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
W0523 15:02:24.165690 32685 slave.cpp:4979] Failed to get resource statistics for executor 'driver-20160523111535-0003' of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0019: Container 'cac7667c-3309-4380-9f95-07d9b888e44e' not found
W0523 15:02:24.165771 32685 slave.cpp:4979] Failed to get resource statistics for executor 'driver-20160523112724-0001' of framework a9871c4b-ab0c-4ddc-8d96-c52faf0e66f7-0020: Container '9c661311-bf7f-4ea6-9348-ce8c7f6cfbcb' not found

从sjs日志

[2016-05-23 15:04:10,305] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4565 with attributes: Map() mem: 63403.0 cpu: 8
[2016-05-23 15:04:10,305] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4566 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:10,305] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4567 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:10,366] WARN  cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
[2016-05-23 15:04:10,505] DEBUG cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - parentName: , name: TaskSet_0, runningTasks: 0
[2016-05-23 15:04:11,306] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4568 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:11,306] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4569 with attributes: Map() mem: 63403.0 cpu: 8
[2016-05-23 15:04:11,505] DEBUG cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - parentName: , name: TaskSet_0, runningTasks: 0
[2016-05-23 15:04:12,308] DEBUG oarseMesosSchedulerBackend [] [] - Declining offer: dc18c89f-d802-404b-9221-71f0f15b096f-O4570 with attributes: Map() mem: 47244.0 cpu: 8
[2016-05-23 15:04:12,505] DEBUG cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql_context] - parentName: , name: TaskSet_0, runningTasks: 0

在master2日志中

May 23 08:19:44 ants-vps mesos-master[1866]: E0523 08:19:44.273349  1902 process.cpp:1958] Failed to shutdown socket with fd 28: Transport endpoint is not connected
May 23 08:19:54 ants-vps mesos-master[1866]: I0523 08:19:54.274245  1899 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (1257)@127.0.0.1:5050
May 23 08:19:54 ants-vps mesos-master[1866]: E0523 08:19:54.274533  1902 process.cpp:1958] Failed to shutdown socket with fd 28: Transport endpoint is not connected
May 23 08:20:04 ants-vps mesos-master[1866]: I0523 08:20:04.275291  1897 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (1260)@127.0.0.1:5050
May 23 08:20:04 ants-vps mesos-master[1866]: E0523 08:20:04.275512  1902 process.cpp:1958] Failed to shutdown socket with fd 28: Transport endpoint is not connected

来自master3:

May 23 08:21:05 ants-vps mesos-master[22023]: I0523 08:21:05.994082 22042 recover.cpp:193] Received a recover response from a replica in EMPTY status
May 23 08:21:15 ants-vps mesos-master[22023]: I0523 08:21:15.994051 22043 recover.cpp:109] Unable to finish the recover protocol in 10secs, retrying
May 23 08:21:15 ants-vps mesos-master[22023]: I0523 08:21:15.994529 22036 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (1282)@127.0.0.1:5050

如何找出问题的原因并加以解决?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题