当活动主机失败时,mesos副本主机不会继续

xvw2m8pv  于 2021-06-21  发布在  Mesos
关注(0)|答案(1)|浏览(363)

我有以下设置-4个centos 7.0虚拟机,名为master,box01,box02,box03。
主机vm有:mesos master,mesos slave
box01:mesos主、mesos从、zkserver
box02:mesos主、mesos从、zkserver
box03:mesos从服务器
每当我在集群上运行mesos框架而没有zookeeper启动时,一切都运行良好。但是,当我部署并启动zookeeper集群时,我运行的框架只有在框架是从同一台机器上运行的情况下才会完成,这台机器就是活动的mesos主机。
e、 我有一位当选的船长在01号信箱。如果我从box01运行一个框架,它会很好地完成。如果我从主框运行它,我会在客户端得到以下日志,并且它不会继续:

  1. I1101 13:56:11.997733 5384 sched.cpp:164] Version: 0.24.0
  2. 2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
  3. 2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@716: Client environment:host.name=master.localdomain
  4. 2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
  5. 2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@724: Client environment:os.arch=3.10.0-229.el7.x86_64
  6. 2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Fri Mar 6 11:36:42 UTC 2015
  7. 2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@733: Client environment:user.name=root
  8. 2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@741: Client environment:user.home=/root
  9. 2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@753: Client environment:user.dir=/home/user/download
  10. 2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=box01:2181,box02:2181,box03:2181 sessionTimeout=10000 watcher=0x7f560236e6d4 sessionId=0 sessionPasswd=<null> context=0x7f5604003c50 flags=0
  11. 2015-11-01 13:56:12,018:5383(0x7f55fd613700):ZOO_INFO@check_events@1703: initiated connection to server [10.0.0.11:2181]
  12. 2015-11-01 13:56:12,025:5383(0x7f55fd613700):ZOO_INFO@check_events@1750: session establishment complete on server [10.0.0.11:2181], sessionId=0x150c2c9ffc6002d, negotiated timeout=10000
  13. I1101 13:56:12.027992 5398 group.cpp:331] Group process (group(1)@10.0.0.10:35217) connected to ZooKeeper
  14. I1101 13:56:12.028153 5398 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
  15. I1101 13:56:12.028198 5398 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
  16. I1101 13:56:12.036267 5398 detector.cpp:156] Detected a new leader: (id='11')
  17. I1101 13:56:12.037309 5398 group.cpp:674] Trying to get '/mesos/json.info_0000000011' in ZooKeeper
  18. I1101 13:56:12.041631 5398 detector.cpp:481] A new leading master (UPID=master@10.0.0.11:5050) is detected
  19. I1101 13:56:12.042068 5398 sched.cpp:262] New master detected at master@10.0.0.11:5050
  20. I1101 13:56:12.043937 5398 sched.cpp:272] No credentials provided. Attempting to register without authentication

我们可以看到,客户端成功地发现10.0.0.11(box01)是代理主机。如果此时我杀了代理mesos master(box01),将进行新的选举,并且由于法定人数为2人(master和box03框),将选举新的master。如果这个主框是主框,那么框架将成功地完成任务。如果是box03,客户机会发现这是主机,然后再次挂起。对此应该有一个简单的解释,但在这一点上我似乎无法摆脱我的思维方式。请帮帮我。
我使用的是mesos-0.24.0,zookeeper-3.4.6。
Zookeeper-3.4.6/conf/zoo.cfg

  1. tickTime=2000
  2. dataDir=/var/lib/zookeeper
  3. clientPort=2181
  4. initLimit=5
  5. syncLimit=2
  6. server.1=box01:2888:3888
  7. server.2=box02:2888:3888
  8. server.3=box03:2888:3888

/etc/hosts文件

  1. 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
  2. ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
  3. 10.0.0.10 master master.localdomain
  4. 10.0.0.11 box01 box01.localdomain
  5. 10.0.0.12 box02 box02.localdomain
  6. 10.0.0.13 box03 box03.localdomain

在每台计算机上,防火墙设置为:

  1. --firewall-cmd --list-ports
  2. 5051/tcp 3888/tcp 2181/tcp 2888/tcp 5050/tcp

要启动mesos master,我使用:

  1. /home/user/download/mesos-0.24.0/build/bin/mesos-master.sh --ip=10.0.0.10 --work_dir=/home/user/download/data-mesos --zk=zk://box01:2181,box02:2181,box03:2181/mesos --quorum=2

要启动mesos slave,我使用:

  1. /home/user/download/mesos-0.24.0/build/bin/mesos-slave.sh --master=zk://box01:2181,box02:2181,box03:2181/mesos

编辑:
结果表明,如果我在box02(10.0.0.12)上运行独立的mesos master,并尝试从master(10.0.0.10)框中运行框架,mesos master会收到框架运行请求作业,但不会执行
box02主日志
主框框架日志

  1. [root@master ~]# java -Djava.library.path=/usr/local/lib -jar /home/user/download/test-framework/example-framework-1.0-SNAPSHOT-jar-with-dependencies.jar box02:5050
  2. I1103 13:44:21.898962 20958 sched.cpp:164] Version: 0.24.0
  3. I1103 13:44:21.910660 20972 sched.cpp:262] New master detected at master@10.0.0.12:5050
  4. I1103 13:44:21.913422 20972 sched.cpp:272] No credentials provided. Attempting to register without authentication

因此,zookeeper似乎与问题无关,而是由于某种原因,主机无法向执行框架的机器(mesos调度器)发回任何信息。

fkvaft9z

fkvaft9z1#

根据您提供的主日志,我猜主服务器无法打开到您的框架的连接。主日志的这部分看起来可疑:

  1. I1103 13:44:21.513394 11288 master.cpp:2094] Received SUBSCRIBE call for framework 'framework-example' at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455
  2. I1103 13:44:21.513703 11288 master.cpp:2164] Subscribing framework framework-example with checkpointing disabled and capabilities [ ]
  3. I1103 13:44:21.516088 11288 hierarchical.hpp:391] Added framework 20151103-134410-201326602-5050-11260-0000
  4. I1103 13:44:21.517375 11288 master.cpp:4613] Sending 1 offers to framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455
  5. E1103 13:44:21.519042 11291 socket.hpp:174] Shutdown failed on fd=14: Transport endpoint is not connected [107]
  6. I1103 13:44:21.520539 11288 master.cpp:1051] Framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455 disconnected
  7. I1103 13:44:21.520593 11288 master.cpp:2370] Disconnecting framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455
  8. I1103 13:44:21.520608 11288 master.cpp:2394] Deactivating framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455
  9. W1103 13:44:21.520922 11288 master.hpp:1409] Master attempted to send message to disconnected framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455

请你检查一下 LIBPROCESS_IP 在框架节点上正确设置了变量,并且主机可以打开到框架节点的连接吗?

相关问题