spark/mesos/任务丢失,从机被列入黑名单,执行者被移除

x6yk4ghg  于 2021-06-26  发布在  Mesos
关注(0)|答案(1)|浏览(491)

我在Spark2.2.0和Scala2.11.11上运行SparkSubmit作业,在Mesos1.4.2上运行sbt。
我有任务丢失和执行人没有注册的问题。以下是症状:
MesoScorSegrainedSchedulerBackend启动任务,直到达到spark.cores.max。例如,它在这里启动6个任务:

18/06/11 12:49:54 DEBUG MesosCoarseGrainedSchedulerBackend: Received 2 resource offers.
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585462 with attributes: Map() mem: 423417.0 cpu: 55.5 ports: List((1025,2180), (2182,3887), (3889,5049), (5052,5507), (5509,8079), (8082,8180), (8182,8792), (8794,9177), (9179,12396), (12398,16297), (16299,16839), (16841,18310), (18312,21795), (21797,22269), (22271,32000)).  Launching 2 Mesos tasks.
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 2 with mem: 11264.0 cpu: 20.0 ports: 
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 0 with mem: 11264.0 cpu: 20.0 ports: 
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585463 with attributes: Map() mem: 300665.0 cpu: 71.5 ports: List((1025,2180), (2182,2718), (2721,3887), (3889,5049), (5052,5455), (5457,8079), (8082,8180), (8182,8262), (8264,8558), (8560,8792), (8794,10231), (10233,16506), (16508,18593), (18595,32000)).  Launching 3 Mesos tasks.
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 4 with mem: 11264.0 cpu: 20.0 ports: 
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 3 with mem: 11264.0 cpu: 20.0 ports: 
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 1 with mem: 11264.0 cpu: 20.0 ports: 
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Received 2 resource offers.
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585464 with attributes: Map() mem: 423417.0 cpu: 55.5 ports: List((1025,2180), (2182,3887), (3889,5049), (5052,5507), (5509,8079), (8082,8180), (8182,8792), (8794,9177), (9179,12396), (12398,16297), (16299,16839), (16841,18310), (18312,21795), (21797,22269), (22271,32000)).  Launching 1 Mesos tasks.
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Launching Mesos task: 5 with mem: 11264.0 cpu: 20.0 ports: 
18/06/11 12:49:55 DEBUG MesosCoarseGrainedSchedulerBackend: Declining offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585465 with attributes: Map() mem: 300665.0 cpu: 71.5 port: List((1025,2180), (2182,2718), (2721,3887), (3889,5049), (5052,5455), (5457,8079), (8082,8180), (8182,8262), (8264,8558), (8560,8792), (8794,10231), (10233,16506), (16508,18593), (18595,32000)) for 120 seconds  (reason: reached spark.cores.max)

然后紧接着,它开始失去任务和黑名单奴隶甚至认为我已经设置 spark.blacklist.enabled=false ```
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 2 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 0 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave a6031461-f185-424d-940e-b45fb64a2aaf-S0 due to too many failures; is Spark installed on it?
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 4 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 3 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave a6031461-f185-424d-940e-b45fb64a2aaf-S1 due to too many failures; is Spark installed on it?
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 1 is now TASK_LOST
18/06/11 12:49:55 INFO MesosCoarseGrainedSchedulerBackend: Blacklisting Mesos slave a6031461-f185-424d-940e-b45fb64a2aaf-S1 due to too many failures; is Spark installed on it?

之后 `non-existent` 遗嘱执行人被免职

18/06/11 12:49:56 DEBUG MesosCoarseGrainedSchedulerBackend: Received 2 resource offers.
18/06/11 12:49:56 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 2 with reason Executor finished with state LOST
18/06/11 12:49:56 INFO BlockManagerMaster: Removal of executor 2 requested
18/06/11 12:49:56 DEBUG MesosCoarseGrainedSchedulerBackend: Declining offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585466 with attributes: Map() mem: 300665.0 cpu: 71.5 port: List((1025,2180), (2182,2718), (2721,3887), (3889,5049), (5052,5455), (5457,8079), (8082,8180), (8182,8262), (8264,8558), (8560,8792), (8794,10231), (10233,16506), (16508,18593), (18595,32000))
18/06/11 12:49:56 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 2
18/06/11 12:49:56 DEBUG MesosCoarseGrainedSchedulerBackend: Declining offer: a6031461-f185-424d-940e-b45fb64a2aaf-O585467 with attributes: Map() mem: 412153.0 cpu: 35.5 port: List((1025,2180), (2182,3887), (3889,5049), (5052,5507), (5509,8079), (8082,8180), (8182,8792), (8794,9177), (9179,12396), (12398,16297), (16299,16839), (16841,18310), (18312,21795), (21797,22269), (22271,32000))
18/06/11 12:49:56 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 0 with reason Executor finished with state LOST
18/06/11 12:49:56 INFO BlockManagerMaster: Removal of executor 0 requested
18/06/11 12:49:56 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0
18/06/11 12:49:56 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 4 with reason Executor finished with state LOST
18/06/11 12:49:59 INFO BlockManagerMaster: Removal of executor 4 requested
18/06/11 12:49:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 4
18/06/11 12:49:59 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 3 with reason Executor finished with state LOST
18/06/11 12:49:59 INFO BlockManagerMaster: Removal of executor 3 requested
18/06/11 12:49:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 3
18/06/11 12:49:59 DEBUG CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove executor 1 with reason Executor finished with state LOST
18/06/11 12:49:59 INFO BlockManagerMaster: Removal of executor 1 requested
18/06/11 12:49:59 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
18/06/11 12:49:59 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 5 is now TASK_RUNNING
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
18/06/11 12:49:59 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.

但是请注意,一个任务5没有丢失,执行器5也没有被移除

18/06/11 12:49:59 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 5 is now TASK_RUNNING
18/06/11 12:50:01 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (SlaveIp:46884) with ID 5
18/06/11 12:50:01 INFO BlockManagerMasterEndpoint: Registering block manager SpaveIP:32840 with 5.2 GB RAM, BlockManagerId(5, SlaveIP, 32840, None)

以下是我的sparksession设置:

val spark = SparkSession.builder
.config("spark.executor.cores", 20)
.config("spark.executor.memory", "10g")
.config("spark.sql.shuffle.partitions", numPartitionsShuffle)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.network.timeout", "1200s")
.config("spark.blacklist.enabled", false)
.config("spark.blacklist.maxFailedTaskPerExecutor", 100)
.config("spark.dynamicAllocation.enabled", false)
.getOrCreate()

这是我的剧本

spark-submit
--class MyMainClass
--master mesos://masterIP:7077
--total-executor-cores 120
--driver-memory 200g
--deploy-mode cluster
--name MyMainClass
--conf "spark.shuffle.service.enabled=false"
--conf "spark.dynamicAllocation.enabled=false"
--conf "spark.blacklist.enabled=false"
--conf "spark.blacklist.maxFailedTaskPerExecutor=100"
--verbose
myJar-assembly-0.1.0-SNAPSHOT.jar

注:
我注意到,如果我休息一下,把工作做得很好。但是,如果我试图接二连三地或在我干掉前一份工作之后,上述问题就会出现。
我的群集上有足够的资源来运行这些任务
我在sparksession和spark submit中复制设置,因为 `config` 与 `--conf` 并不总是很清楚。
在非动态模式下运行很重要。
失去的遗嘱执行人
我将调试日志与基于spark2.0.1的旧的仍处于活动状态的退役集群安装的日志进行了比较。有完全相同的代码启动任务,立即得到一个 `TASK_RUNNING` 状态。
我的google和stackoverflow搜索没有得到任何有用的信息。
设置 `spark.blacklist.maxFailedTaskPerExecutor` 以及 `spark.blacklist.enabled` 似乎不起作用
相关未回答的问题[spark on mesos(dc/os)在做任何事情之前丢失任务](spark on mesos(dc/os)在做任何事情之前丢失任务
我完全不知道发生了什么。
问题:
你需要更多的信息来帮我诊断吗?
为什么工作一启动就失去了大部分任务?我看到了任务的原因,但似乎没有一个原因能解释它。
为什么要求撤换不存在的遗嘱执行人?
我应该朝哪个方向看?
这是否与前一个任务被终止,而没有等待足够长的时间来启动下一个任务有关?
efzxgjgh

efzxgjgh1#

我在回答我自己的问题:
我们发现我们的问题是双重的。
主机和工作机之间的通信/连接出现了一些未确认的问题,导致mesos任务(执行器)丢失。日志中没有解释这个问题是什么。
每当一个工人至少丢失2个mesos任务时,它就会被列入黑名单。在spark 2.2中,2的限制在代码中是硬编码的,不能更改。有关详细信息,请参见:对于mesoscoparsegrainedschedulerbackend,黑名单始终处于活动状态
因此:
有时通信问题没有发生,作业执行正常。
大多数时候,所有的遗嘱执行人都是在工作一开始就失踪了。通过在我们的集群中有2个工人,我们一次只能运行3个执行器。在作业开始时,所有执行器(worker1上的2个和worker2上的1个)都将丢失,但只有worker1将被列入黑名单,并且丢失的执行器将在worker2上重新启动并继续运行,不会出现问题。
解决方案:
我不确定这是否是这个问题的一般解决方案,但我们有些盲目地寻找调控不同Mesos的配置 timeout 我们在mesos 1.4中发现了这个错误:
在mesos本机调度程序客户端中使用failovertimeout为0会导致无限订阅循环
作为测试,我们设定了 SparkSession 配置 spark.mesos.driver.failoverTimeout=1.0 . 这似乎解决了我们的问题。在工作开始的时候,我们不会再失去遗嘱执行人。

相关问题