mesos上的Spark流——马拉松任务的随机失败

uz75evzq  于 2021-06-26  发布在  Mesos
关注(0)|答案(0)|浏览(231)

我将mesos1.1.0集群用于spark批处理和spark流处理作业。批处理在使用相同的spark uri-s时没有任何问题,流式处理有时表现异常。通过运行简单的bash提交脚本,使用marathon1.3.6启动流式处理作业。脚本在所有从属服务器上都是本地的,在所有从属服务器上都是相同的。
脚本:

spark-submit --class "$1" --master  mesos://zk://master1:2181,master2:2181,master3:2181/mesos --conf spark.executor.uri=/spark/spark-2.0.1-bin-hadoop2.6.tgz --conf spark.cores.max=2 --conf spark.eventLog.enabled=false /spark_jobs/"$2"

这一切大部分时间都很好,马拉松部署工作,工作开始,并保持活跃和工作。这是一个开始很好的工作的mesos标准样品。这只是日志文件的开始,因为我认为它是问题出现的地方。

#################### 正常启动

I1224 09:02:00.480056 23118 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3\/root","items":[{"action":"BYPASS_CACHE","uri":{"extract":true,"value":"\/spark\/spark-2.0.1-bin-hadoop2.6.tgz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3\/frameworks\/cd6497c0-d374-4957-b861-286fd0b27587-13920\/executors\/0\/runs\/feb948b2-1b37-4962-a682-77aeb45a3063","user":"root"}
I1224 09:02:00.483033 23118 fetcher.cpp:409] Fetching URI '/spark/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:00.483050 23118 fetcher.cpp:250] Fetching directly into the sandbox directory
I1224 09:02:00.483072 23118 fetcher.cpp:187] Fetching URI '/spark/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:00.483088 23118 fetcher.cpp:167] Copying resource with command:cp '/spark/spark-2.0.1-bin-hadoop2.6.tgz' '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:00.675047 23118 fetcher.cpp:84] Extracting with command: tar -C '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063' -xf '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:02.797693 23118 fetcher.cpp:92] Extracted '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063/spark-2.0.1-bin-hadoop2.6.tgz' into '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063'
I1224 09:02:02.797751 23118 fetcher.cpp:547] Fetched '/spark/spark-2.0.1-bin-hadoop2.6.tgz' to '/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/cd6497c0-d374-4957-b861-286fd0b27587-13920/executors/0/runs/feb948b2-1b37-4962-a682-77aeb45a3063/spark-2.0.1-bin-hadoop2.6.tgz'
I1224 09:02:02.850260 23144 exec.cpp:162] Version: 1.1.0
I1224 09:02:02.852879 23147 exec.cpp:237] Executor registered on agent bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/24 09:02:03 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 23155@mr02
16/12/24 09:02:03 INFO SignalUtils: Registered signal handler for TERM
16/12/24 09:02:03 INFO SignalUtils: Registered signal handler for HUP
16/12/24 09:02:03 INFO SignalUtils: Registered signal handler for INT
16/12/24 09:02:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/24 09:02:04 INFO SecurityManager: Changing view acls to: root
16/12/24 09:02:04 INFO SecurityManager: Changing modify acls to: root
16/12/24 09:02:04 INFO SecurityManager: Changing view acls groups to: 
16/12/24 09:02:04 INFO SecurityManager: Changing modify acls groups to: 
16/12/24 09:02:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
16/12/24 09:02:04 INFO TransportClientFactory: Successfully created connection to /10.210.197.151:57731 after 82 ms (0 ms spent in bootstraps)

有时,当我们重新启动这个作业时,作业被初始化,它在mesos中注册为框架,它启动spark web ui,但它什么也不做,所有任务都排队,它正在做一些奇怪的事情(计算什么的阶段?)。有时需要多次重启才能使其正常工作,这不是期望的行为,因为过程应该是自动化的。

异常启动

I1224 08:32:11.346411 19741 exec.cpp:162] Version: 1.1.0
I1224 08:32:11.348379 19743 exec.cpp:237] Executor registered on agent bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@730: Client environment:host.name=mr02
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@738: Client environment:os.arch=3.13.0-100-generic
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@739: Client environment:os.version=#147-Ubuntu SMP Tue Oct 18 16:48:51 UTC 2016
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@747: Client environment:user.name=(null)
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@755: Client environment:user.home=/root
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@log_env@767: Client environment:user.dir=/var/lib/mesos/slaves/bf8a3062-0309-4c1f-91cd-79d7515f9ce7-S3/frameworks/aee5cbd3-b3e8-431d-bc18-fb5107d54074-0000/executors/jobs_storagetraffic17.76d13635-c9b3-11e6-bf95-bc764e104d04/runs/cd1a62ea-4cda-4599-a6fc-e7c5d1ad3956
2016-12-24 08:32:13,838:19753(0x7f8864b02700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=mr01:2181,mr02:2181,mr03:2181 sessionTimeout=10000 watcher=0x7f8872f72200 sessionId=0 sessionPasswd=<null> context=0x7f883c000b10 flags=0
I1224 08:32:13.839673 19843 sched.cpp:226] Version: 1.1.0
2016-12-24 08:32:13,840:19753(0x7f88577fe700):ZOO_INFO@check_events@1728: initiated connection to server [10.210.224.189:2181]
2016-12-24 08:32:13,845:19753(0x7f88577fe700):ZOO_INFO@check_events@1775: session establishment complete on server [10.210.224.189:2181], sessionId=0x3592d18d17a001a, negotiated timeout=10000
I1224 08:32:13.845521 19838 group.cpp:340] Group process (zookeeper-group(1)@10.210.198.20:43514) connected to ZooKeeper
I1224 08:32:13.845562 19838 group.cpp:828] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1224 08:32:13.845579 19838 group.cpp:418] Trying to create path '/mesos' in ZooKeeper
I1224 08:32:13.846849 19838 detector.cpp:152] Detected a new leader: (id='27')
I1224 08:32:13.846951 19838 group.cpp:697] Trying to get '/mesos/json.info_0000000027' in ZooKeeper
I1224 08:32:13.847759 19841 zookeeper.cpp:259] A new leading master (UPID=master@10.210.224.189:5050) is detected
I1224 08:32:13.847839 19838 sched.cpp:330] New master detected at master@10.210.224.189:5050
I1224 08:32:13.848011 19838 sched.cpp:341] No credentials provided. Attempting to register without authentication
I1224 08:32:13.850549 19841 sched.cpp:743] Framework registered with cd6497c0-d374-4957-b861-286fd0b27587-13914
16/12/24 08:32:14 INFO utils.VerifiableProperties: Verifying properties
16/12/24 08:32:14 INFO utils.VerifiableProperties: Property group.id is overridden to 
16/12/24 08:32:14 INFO utils.VerifiableProperties: Property zookeeper.connect is overridden to 
16/12/24 08:32:20 INFO utils.VerifiableProperties: Verifying properties
16/12/24 08:32:20 INFO utils.VerifiableProperties: Property group.id is overridden to 
16/12/24 08:32:20 INFO utils.VerifiableProperties: Property zookeeper.connect is overridden to 

[Stage 0:>                                                         (0 + 0) / 31]
[Stage 0:>                                                         (0 + 1) / 31]
[Stage 0:>                                                         (0 + 2) / 31]
[Stage 0:=>                                                        (1 + 3) / 31]
[Stage 0:==================================>                      (19 + 0) / 31]

[Stage 1:==================================>                      (19 + 0) / 31]

[Stage 2:==================================>                      (19 + 0) / 31]

[Stage 3:==================================>                      (19 + 0) / 31]

[Stage 4:==================================>                      (19 + 0) / 31]

[Stage 5:==================================>                      (19 + 0) / 31]

[Stage 6:==================================>                      (19 + 0) / 31]

[Stage 7:==================================>                      (19 + 0) / 31]

[Stage 8:==================================>                      (19 + 0) / 31]

[Stage 9:==================================>                      (19 + 0) / 31]

[Stage 10:==================================>                     (19 + 0) / 31]

[Stage 11:==================================>                     (19 + 0) / 31]

[Stage 12:==================================>                     (19 + 0) / 31]

[Stage 13:==================================>                     (19 + 0) / 31]
[Stage 13:==================================>                     (19 + 2) / 31]

[Stage 14:==================================>                     (19 + 0) / 31]
[Stage 14:=====================================>                  (21 + 2) / 31]

[Stage 15:==================================>                     (19 + 0) / 31]
[Stage 15:=====================================>                  (21 + 2) / 31]

[Stage 16:==================================>                     (19 + 0) / 31]
[Stage 16:=========================================>              (23 + 2) / 31]

[Stage 17:==================================>                     (19 + 0) / 31]
[Stage 17:=============================================>          (25 + 2) / 31]

[Stage 18:==================================>                     (19 + 0) / 31]
[Stage 18:=============================================>          (25 + 2) / 31]

[Stage 19:==================================>                     (19 + 0) / 31]
[Stage 19:=============================================>          (25 + 2) / 31]

[Stage 20:==================================>                     (19 + 0) / 31]
[Stage 20:===========================================>            (24 + 2) / 31]

[Stage 21:==================================>                     (19 + 0) / 31]

[Stage 22:==================================>                     (19 + 0) / 31]

[Stage 23:==================================>                     (19 + 0) / 31]

[Stage 24:==================================>                     (19 + 0) / 31]

[Stage 25:==================================>                     (19 + 0) / 31]

[Stage 26:==================================>                     (19 + 0) / 31]

[Stage 27:==================================>                     (19 + 0) / 31]

[Stage 28:==================================>                     (19 + 0) / 31]

[Stage 29:==================================>                     (19 + 0) / 31]

[Stage 30:==================================>                     (19 + 0) / 31]

[Stage 31:==================================>                     (19 + 0) / 31]

[Stage 32:==================================>                     (19 + 0) / 31]

[Stage 33:==================================>                     (19 + 0) / 31]

[Stage 34:==================================>                     (19 + 0) / 31]

[Stage 35:==================================>                     (19 + 0) / 31]

[Stage 36:==================================>                     (19 + 0) / 31]

[Stage 37:==================================>                     (19 + 0) / 31]

[Stage 38:==================================>                     (19 + 0) / 31]

[Stage 39:==================================>                     (19 + 0) / 31]

我怀疑问题出在日志文件的前几行,由于某种原因,异常作业没有fetcher部分,可能它没有复制/spark/spark-2.0.1-bin-hadoop2.6.tgz,并且没有二进制文件启动…但是,它如何启动spark ui?
有没有人知道如何解决这个问题,在哪里寻找额外的线索,测试什么来获得更多的数据?
谢谢,
伊凡

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题