我在ApacheMesos 1.8.1上使用ApacheFlink1.9.1,在3节点集群上对这两个节点使用或多或少的标准设置。第一个节点(192.168.10.11)运行mesos主机和flink应用程序主机。其他两个节点(192.168.10.13、192.168.10.14)都是运行mesos代理的从属节点,可以访问flink worker docker容器。flink是从docker内部运行的,但mesos master和代理是在主机上本地运行的。
一开始一切正常。我提交给flink的第一批作业在app master上被正确地执行,资源管理器与mesos协商成功,资源管理器框架在其中一个从属节点上得到一个新的worker任务。任务成功完成,客户端程序得到结果。
几分钟后,resourcemanager超时,flink worker任务(docker容器)由于空闲超时而终止。到现在为止,一直都还不错。
但在那一刻之后,如果我以与第一次完全相同的方式再次提交同一个作业,资源管理器将不做任何操作,也不会尝试创建任何新的worker。作业等待5分钟,超时后失败。
如果有人能告诉我哪里出了问题,我会很高兴的。
以下是flink应用程序主机和mesos主机的相关日志摘录:
flink应用程序主程序的启动
2019-10-24 06:36:05,191 INFO ....entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2019-10-24 06:36:05,195 INFO ....entrypoint.ClusterEntrypoint - Starting MesosSessionClusterEntrypoint (Version: 1.9.1, Rev:4d56de8, Date:30.09.2019 @ 11:32:19 CST)
2019-10-24 06:36:05,195 INFO ....entrypoint.ClusterEntrypoint - OS current user: flink
2019-10-24 06:36:07,045 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-10-24 06:36:07,504 INFO ....entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: flink
2019-10-24 06:36:07,504 INFO ....entrypoint.ClusterEntrypoint - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.222-b10
2019-10-24 06:36:07,504 INFO ....entrypoint.ClusterEntrypoint - Maximum heap size: 2169 MiBytes
2019-10-24 06:36:07,504 INFO ....entrypoint.ClusterEntrypoint - JAVA_HOME: (not set)
2019-10-24 06:36:07,519 INFO ....entrypoint.ClusterEntrypoint - Hadoop version: 2.8.3
2019-10-24 06:36:07,520 INFO ....entrypoint.ClusterEntrypoint - JVM Options:
2019-10-24 06:36:07,520 INFO ....entrypoint.ClusterEntrypoint - -Dlog.file=/usr/local/lib/flink-1.9.1/log/flink--mesos-appmaster-deb-node1.log
2019-10-24 06:36:07,520 INFO ....entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:/usr/local/lib/flink-1.9.1/conf/log4j.properties
2019-10-24 06:36:07,520 INFO ....entrypoint.ClusterEntrypoint - -Dlogback.configurationFile=file:/usr/local/lib/flink-1.9.1/conf/logback.xml
2019-10-24 06:36:07,520 INFO ....entrypoint.ClusterEntrypoint - Program Arguments: (none)
(...)
2019-10-24 06:36:07,520 INFO ....entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2019-10-24 06:36:07,526 INFO ....entrypoint.ClusterEntrypoint - Registered UNIX signal handlers for [TERM, HUP, INT]
2019-10-24 06:36:07,574 INFO ....configuration.GlobalConfiguration - Loading configuration property
----
properties summarized :
----
jobmanager.rpc.address, 192.168.10.11
jobmanager.rpc.port, 6123
jobmanager.heap.size, 1024m
taskmanager.heap.size, 1024m
taskmanager.numberOfTaskSlots, 2
parallelism.default, 2
state.checkpoints.dir, file:///var/lib/flink/data/ckeckpoints
state.savepoints.dir, file:///var/lib/flink/data/savepoints
jobmanager.execution.failover-strategy, region
rest.port, 8081
rest.bind-address, 0.0.0.0
jobmanager.archive.fs.dir, file:///var/lib/flink/completed_jobs/
historyserver.archive.fs.dir, file:///var/lib/flink/completed_jobs/
mesos.resourcemanager.tasks.container.type, docker
mesos.resourcemanager.tasks.container.image.name, eskimo:flink-worker
mesos.resourcemanager.tasks.bootstrap-cmd, export FLINK_HOME=/usr/local/lib/flink/
mesos.resourcemanager.tasks.container.volumes, /var/log/flink:/var/log/flink:RW,/var/lib/flink:/var/lib/flink:RW,/etc:/host_etc:RO
mesos.resourcemanager.tasks.cpus, 2
mesos.resourcemanager.tasks.mem, 1024
mesos.master, zk://192.168.10.11:2181/mesos
----
2019-10-24 06:36:07,753 INFO ....entrypoint.ClusterEntrypoint - Starting MesosSessionClusterEntrypoint.
(...)
2019-10-24 06:36:09,393 INFO ....runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start actor system at 192.168.10.11:6123
(...)
(bla bla bla akka starting and all the rest)
(...)
2019-10-24 06:36:14,842 INFO ....MesosResourceManager - Registering as new framework.
2019-10-24 06:36:14,842 INFO ....MesosResourceManager - --------------------------------------------------------------------------------
2019-10-24 06:36:14,842 INFO ....MesosResourceManager - Mesos Info:
2019-10-24 06:36:14,842 INFO ....MesosResourceManager - Master URL: zk://192.168.10.11:2181/mesos
2019-10-24 06:36:14,842 INFO ....MesosResourceManager - Framework Info:
2019-10-24 06:36:14,842 INFO ....MesosResourceManager - ID: (none)
2019-10-24 06:36:14,842 INFO ....MesosResourceManager - Name: Flink
2019-10-24 06:36:14,842 INFO ....MesosResourceManager - Failover Timeout (secs): 604800.0
2019-10-24 06:36:14,842 INFO ....MesosResourceManager - Role: *
2019-10-24 06:36:14,843 INFO ....MesosResourceManager - Capabilities: (none)
2019-10-24 06:36:14,843 INFO ....MesosResourceManager - Principal: (none)
2019-10-24 06:36:14,843 INFO ....MesosResourceManager - Host: 192.168.10.11
2019-10-24 06:36:14,843 INFO ....MesosResourceManager - Web UI: http://192.168.10.11:8081
2019-10-24 06:36:14,843 INFO ....MesosResourceManager - --------------------------------------------------------------------------------
2019-10-24 06:36:14,877 INFO ....runtime.dispatcher.StandaloneDispatcher - Dispatcher akka.tcp://flink@192.168.10.11:6123/user/dispatcher was granted leadership with fencing token 00000000-0000-0000-0000-000000000000
(...)
2019-10-24 06:36:15,714 INFO ....mesos.scheduler.ConnectionMonitor - Connecting to Mesos...
2019-10-24 06:36:15,764 INFO ....MesosResourceManager - Mesos resource manager started.
(...)
2019-10-24 06:36:46,033 INFO ....mesos.scheduler.ConnectionMonitor - Connected to Mesos as framework ID e1432c43-f540-4d53-9cc7-b06e0f2a6525-0000.
在flink应用程序主程序启动后立即登录mesos主程序
I1024 06:36:52.096225 19 master.cpp:2920] Received SUBSCRIBE call for framework 'Flink' at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.097496 25 master.cpp:3001] Subscribing framework Flink with checkpointing enabled and capabilities [ ]
I1024 06:36:52.098603 25 master.cpp:10463] Adding framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 with roles { } suppressed
I1024 06:36:52.099602 18 hierarchical.cpp:373] Added framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:36:52.101519 18 master.cpp:10048] Sending offers [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O1, 8977e508-0698-4c40-9317-cdc3cc8982f4-O2 ] to framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.185779 20 master.cpp:3438] Processing SUPPRESS call for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:36:52.186064 25 hierarchical.cpp:1346] Suppressed offers for roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
在第一次提交作业时登录flink应用程序主控程序(工作正常的作业)
2019-10-24 06:37:18,800 INFO ....runtime.dispatcher.StandaloneDispatcher - Received JobGraph submission 50934ae259b85a449e9a789f82287305 (Flink Java Job at Thu Oct 24 06:37:17 UTC 2019).
2019-10-24 06:37:18,804 INFO ....runtime.dispatcher.StandaloneDispatcher - Submitting job 50934ae259b85a449e9a789f82287305 (Flink Java Job at Thu Oct 24 06:37:17 UTC 2019).
2019-10-24 06:37:18,889 INFO ....runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for ....JobMaster at akka://flink/user/jobmanager_0 .
2019-10-24 06:37:18,933 INFO ....JobMaster - Initializing job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305).
(...)
2019-10-24 06:37:19,160 INFO ....jobmaster.JobManagerRunner - JobManager runner for job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0.
2019-10-24 06:37:19,175 INFO ....JobMaster - Starting execution of job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) under job master id 00000000000000000000000000000000.
2019-10-24 06:37:19,177 INFO ....ExecutionGraph - Job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) switched from state CREATED to RUNNING.
2019-10-24 06:37:19,191 INFO ....ExecutionGraph - DataSource (at $line107.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (842c8e8d1e6b91746b428331ab8712d3) switched from CREATED to SCHEDULED.
2019-10-24 06:37:19,229 INFO ....SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{4c886029d8d6a410cde790ba6a7ac3c0}]
2019-10-24 06:37:19,265 INFO ....JobMaster - Connecting to ResourceManager akka.tcp://flink@192.168.10.11:6123/user/resourcemanager(00000000000000000000000000000000)
2019-10-24 06:37:19,272 INFO ....JobMaster - Resolved ResourceManager address, beginning registration
2019-10-24 06:37:19,273 INFO ....JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms)
2019-10-24 06:37:19,284 INFO ....MesosResourceManager - Registering job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0 for job 50934ae259b85a449e9a789f82287305.
2019-10-24 06:37:19,296 INFO (...)
2019-10-24 06:37:19,305 INFO ....JobMaster - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2019-10-24 06:37:19,306 INFO ....SlotPoolImpl - Requesting new slot [SlotRequestId{4c886029d8d6a410cde790ba6a7ac3c0}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} from resource manager.
2019-10-24 06:37:19,311 INFO ....MesosResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} for job 50934ae259b85a449e9a789f82287305 with allocation id 907586ba9ece54f20c23341276e6bac8.
2019-10-24 06:37:19,313 INFO ....MesosResourceManager - Starting a new worker.
2019-10-24 06:37:19,458 INFO ....MesosResourceManager - Scheduling Mesos task taskmanager-00001 with (1024.0 MB, 2.0 cpus).
2019-10-24 06:37:19,473 INFO ....mesos.scheduler.LaunchCoordinator - Now gathering offers for at least 1 task(s).
2019-10-24 06:37:19,495 INFO ....mesos.scheduler.LaunchCoordinator - Received offer(s) of 9432.0 MB, 4.0 cpus, 0.0 gpus:
2019-10-24 06:37:19,499 INFO ....mesos.scheduler.LaunchCoordinator - 8977e508-0698-4c40-9317-cdc3cc8982f4-O3 from 192.168.10.13 of 4639.0 MB, 2.0 cpus, 0.0 gpus for [*]
2019-10-24 06:37:19,500 INFO ....mesos.scheduler.LaunchCoordinator - 8977e508-0698-4c40-9317-cdc3cc8982f4-O4 from 192.168.10.14 of 4793.0 MB, 2.0 cpus, 0.0 gpus for [*]
2019-10-24 06:37:20,517 INFO ....mesos.scheduler.LaunchCoordinator - Processing 1 task(s) against 2 new offer(s) plus outstanding offers.
2019-10-24 06:37:20,578 INFO ....mesos.scheduler.LaunchCoordinator - Resources considered: (note: expired offers not deducted from below)
2019-10-24 06:37:20,587 INFO ....mesos.scheduler.LaunchCoordinator - 192.168.10.13 has 4639.0 MB, 2.0 cpus, null gpus
2019-10-24 06:37:20,841 INFO ....MesosResourceManager - Launching Mesos task taskmanager-00001 on host 192.168.10.14.
2019-10-24 06:37:20,841 INFO ....mesos.scheduler.LaunchCoordinator - Launched 1 task(s) on 192.168.10.14 using 1 offer(s):
2019-10-24 06:37:20,842 INFO ....mesos.scheduler.LaunchCoordinator - 8977e508-0698-4c40-9317-cdc3cc8982f4-O4
2019-10-24 06:37:20,843 INFO ....mesos.scheduler.LaunchCoordinator - No longer gathering offers; all requests fulfilled.
2019-10-24 06:37:20,844 INFO com.netflix.fenzo.TaskScheduler - Expiring all leases
2019-10-24 06:37:20,846 INFO ....mesos.scheduler.LaunchCoordinator - Declined offer 8977e508-0698-4c40-9317-cdc3cc8982f4-O3 from 192.168.10.13 of 4639.0 MB, 2.0 cpus.
2019-10-24 06:37:26,251 INFO ....mesos.scheduler.TaskMonitor - Mesos task taskmanager-00001 is running.
2019-10-24 06:37:29,575 INFO ....MesosResourceManager - Registering TaskManager with ResourceID taskmanager-00001 (akka.tcp://flink@192.168.10.14:31001/user/taskmanager_0) at ResourceManager
2019-10-24 06:37:29,693 INFO ....ExecutionGraph - DataSource (at $line107.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (842c8e8d1e6b91746b428331ab8712d3) switched from SCHEDULED to DEPLOYING.
(...)
(everything going on fine)
(...)
2019-10-24 06:37:30,041 INFO ....ExecutionGraph - Job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019 (50934ae259b85a449e9a789f82287305) switched from state RUNNING to FINISHED.
2019-10-24 06:37:30,076 INFO ....StandaloneDispatche r - Job 50934ae259b85a449e9a789f82287305 reached globally terminal state FINISHED.
2019-10-24 06:37:30,205 INFO ....FsJobArchivist - Job 50934ae259b85a449e9a789f82287305 has been archived at file:/var/lib/flink/completed_jobs/50934ae259b85a449e9a789f82287305.
2019-10-24 06:37:30,206 INFO ....JobMaster - Stopping the JobMaster for job Flink Java Job at Thu Oct 24 06:37:17 UTC 2019(50934ae259b85a449e9a789f82287305).
2019-10-24 06:37:30,226 INFO ....SlotPoolImpl - Suspending SlotPool.
2019-10-24 06:37:30,227 INFO ....JobMaster - Close ResourceManager connection 9c3fc959e219c30fba3395bb7be9823c: JobManager is shutting down..
2019-10-24 06:37:30,227 INFO ....SlotPoolImpl - Stopping SlotPool.
2019-10-24 06:37:30,232 INFO ....MesosResourceManager - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_0 for job 50934ae259b85a449e9a789f82287305 from the resource manager.
2019-10-24 06:37:30,232 INFO ....jobmaster.JobManagerRunner - JobManagerRunner already shutdown.
在Mesos中对应的时间:
I1024 06:37:19.475060 23 hierarchical.cpp:1387] Unsuppressed offers for roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:19.475097 23 hierarchical.cpp:1409] Revived roles { * } of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:19.476991 19 master.cpp:10048] Sending offers [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O3, 8977e508-0698-4c40-9317-cdc3cc8982f4-O4 ] to framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:37:20.845321 19 master.cpp:3438] Processing SUPPRESS call for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
(...)
I1024 06:37:20.856938 23 master.cpp:4590] Processing ACCEPT call for offers: [ 8977e508-0698-4c40-9317-cdc3cc8982f4-O4 ] on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) for framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:37:20.860610 20 master.cpp:4151] Adding task taskmanager-00001 with resources cpus(allocated: *):2; mem(allocated: *):1024; ports(allocated: *):[31000-31001] of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:20.861378 20 master.cpp:5569] Launching task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411 with resources [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":2.0},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":1024.0},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"ports","ranges":{"range":[{"begin":31000,"end":31001}]},"type":"RANGES"}] on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) on new executor
I1024 06:37:22.788302 24 master.cpp:8703] Status update TASK_STARTING (Status UUID: 240ff22e-d729-469c-a96d-aed5b19c0101) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:22.788444 24 master.cpp:8760] Forwarding status update TASK_STARTING (Status UUID: 240ff22e-d729-469c-a96d-aed5b19c0101) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:22.790689 24 master.cpp:11672] Updating the state of task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (latest state: TASK_STARTING, status update state: TASK_STARTING)
(...)
I1024 06:37:26.241962 23 master.cpp:8703] Status update TASK_RUNNING (Status UUID: d96cf8c7-5d22-4dfb-9ca5-349a847b5806) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:37:26.242094 23 master.cpp:8760] Forwarding status update TASK_RUNNING (Status UUID: d96cf8c7-5d22-4dfb-9ca5-349a847b5806) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
I1024 06:37:26.243048 23 master.cpp:11672] Updating the state of task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (latest state: TASK_RUNNING, status update state: TASK_RUNNING)
几分钟后,taskmanager的空闲超时开始,taskmanager被终止
2019-10-24 06:38:21,649 INFO ....MesosResourceManager - Stopping worker taskmanager-00001.
2019-10-24 06:38:21,672 INFO ....MesosResourceManager - Closing TaskExecutor connection taskmanager-00001 because: TaskExecutor exceeded the idle timeout.
2019-10-24 06:38:21,760 WARN ....MesosResourceManager - Discard registration from TaskExecutor taskmanager-00001 at (akka.tcp://flink@192.168.10.14:31001/user/taskmanager_0) because the framework did not recognize it
2019-10-24 06:38:26,674 INFO ....mesos.scheduler.TaskMonitor - Re-attempting to kill Mesos task taskmanager-00001.
2019-10-24 06:38:26,891 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@192.168.10.14:31001] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2019-10-24 06:38:27,318 INFO ....mesos.scheduler.TaskMonitor - Mesos task taskmanager-00001 exited as planned.
2019-10-24 06:38:27,326 INFO ....MesosResourceManager - Worker taskmanager-00001 finished successfully with message: Container exited with status 137
mesos中的相应日志:
I1024 06:38:26.678206 20 master.cpp:6261] Processing KILL call for task 'taskmanager-00001' of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:38:26.678450 20 master.cpp:6339] Telling agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14) to kill task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 (Flink) at scheduler-962dfe34-be82-44c5-9c59-95d01938bdee@127.0.1.1:44411
I1024 06:38:27.304287 20 master.cpp:8703] Status update TASK_KILLED (Status UUID: 94441e8d-02eb-4252-9014-abea6b372acc) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 from agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
I1024 06:38:27.304426 20 master.cpp:8760] Forwarding status update TASK_KILLED (Status UUID: 94441e8d-02eb-4252-9014-abea6b372acc) for task taskmanager-00001 of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000
(...)
I1024 06:38:27.319533 19 master.cpp:11770] Removing task taskmanager-00001 with resources cpus(allocated: *):2; mem(allocated: *):1024; ports(allocated: *):[31000-31001] of framework 8977e508-0698-4c40-9317-cdc3cc8982f4-0000 on agent 8977e508-0698-4c40-9317-cdc3cc8982f4-S1 at slave(1)@192.168.10.14:5051 (192.168.10.14)
app master中的第二个作业提交(保持挂起并在超时后停止而不进行任何处理的作业)
2019-10-24 06:39:25,220 INFO ....StandaloneDispatcher - Received JobGraph submission 4888c165c1a9eae4dd69cd533c66a5c1 (Flink Java Job at Thu Oct 24 06:39:24 UTC 2019).
2019-10-24 06:39:25,220 INFO ....StandaloneDispatcher - Submitting job 4888c165c1a9eae4dd69cd533c66a5c1 (Flink Java Job at Thu Oct 24 06:39:24 UTC 2019).
2019-10-24 06:39:25,221 INFO ....AkkaRpcService - Starting RPC endpoint for ....JobMaster at akka://flink/user/jobmanager_1 .
2019-10-24 06:39:25,225 INFO ....JobMaster - Initializing job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,228 INFO ....JobMaster - Using restart strategy NoRestartStrategy for Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,230 INFO ....ExecutionGraph - Job recovers via failover strategy: New Pipelined Region Failover
2019-10-24 06:39:25,231 INFO ....JobMaster - Running initialization on master for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:39:25,232 INFO ....JobMaster - Successfully ran initialization on master in 1 ms.
2019-10-24 06:39:25,234 INFO ....executiongraph.failover.flip1.RestartPipelinedRegionStrategy - Start building failover regions.
2019-10-24 06:39:25,234 INFO ....executiongraph.failover.flip1.RestartPipelinedRegionStrategy - Created 1 failover regions.
2019-10-24 06:39:25,235 INFO ....jobmaster.JobManagerRunner - JobManager runner for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1.
2019-10-24 06:39:25,236 INFO ....JobMaster - Starting execution of job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) under job master id 00000000000000000000000000000000.
2019-10-24 06:39:25,236 INFO ....ExecutionGraph - Job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019 (4888c165c1a9eae4dd69cd533c66a5c1) switched from state CREATED to RUNNING.
2019-10-24 06:39:25,236 INFO ....ExecutionGraph - DataSource (at $line115.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) (....api.java.io.CollectionInputFormat)) (1/1) (d7d19ea639a4adf9246f905f1765b338) switched from CREATED to SCHEDULED.
2019-10-24 06:39:25,237 INFO ....SlotPoolImpl - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}]
2019-10-24 06:39:25,238 INFO ....JobMaster - Connecting to ResourceManager akka.tcp://flink@192.168.10.11:6123/user/resourcemanager(00000000000000000000000000000000)
2019-10-24 06:39:25,239 INFO ....JobMaster - Resolved ResourceManager address, beginning registration
2019-10-24 06:39:25,239 INFO ....JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms)
2019-10-24 06:39:25,240 INFO ....MesosResourceManager - Registering job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1 for job 4888c165c1a9eae4dd69cd533c66a5c1.
2019-10-24 06:39:25,245 INFO ....JobMaster - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2019-10-24 06:39:25,246 INFO ....SlotPoolImpl - Requesting new slot [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} from resource manager.
2019-10-24 06:39:25,246 INFO ....MesosResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, managedMemoryInMB=-1} for job 4888c165c1a9eae4dd69cd533c66a5c1 with allocation id 04a6f120eed690c9c7cdc8c7c0f67ed7.
在这个阶段,原木会停留几分钟。在Mesos中没有任何活动。
在我看来,问题显然出在flink应用程序主/资源管理器上,因为mesos中再次出现了问题。此时mesos主日志中没有任何活动(mesos控制台没有显示任何内容,没有提交或执行新任务,两个代理都没有执行任何操作)。
几分钟后发生超时,作业在失败状态下停止。
超时后第二个作业失败
2019-10-24 06:44:25,245 INFO ....ExecutionGraph - DataSource (at $line115.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:24) org.apache.flink.api.java.io.CollectionInputFormat)) (1/1) (d7d19ea639a4adf9246f905f1765b338) switched from SCHEDULED to FAILED.
.... NoResourceAvailableException: Could not allocate enough slots to run the job. Please make sure that the cluster has enough resources.
at org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:460)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
(stack traces goes on)
2019-10-24 06:44:25,261 INFO ....executiongraph...AdaptedRestartPipelinedRegionStrategyNG - Fail to pass the restart strategy validation in region failover. Fallback to fail global.
(...)
(a few more stack traces)
(...)
2019-10-24 06:44:25,296 INFO ....ExecutionGraph - Discarding the results produced by task execution d7d19ea639a4adf9246f905f1765b338.
2019-10-24 06:44:25,296 INFO ....SlotPoolImpl - Pending slot request [SlotRequestId{ddfc7c3f479469c5e6c72e505299cf1d}] timed out.
2019-10-24 06:44:25,301 INFO ....dispatcher.StandaloneDispatcher - Job 4888c165c1a9eae4dd69cd533c66a5c1 reached globally terminal state FAILED.
2019-10-24 06:44:25,340 INFO ....history.FsJobArchivist - Job 4888c165c1a9eae4dd69cd533c66a5c1 has been archived at file:/var/lib/flink/completed_jobs/4888c165c1a9eae4dd69cd533c66a5c1.
2019-10-24 06:44:25,341 INFO ....JobMaster - Stopping the JobMaster for job Flink Java Job at Thu Oct 24 06:39:24 UTC 2019(4888c165c1a9eae4dd69cd533c66a5c1).
2019-10-24 06:44:25,344 INFO ....SlotPoolImpl - Suspending SlotPool.
2019-10-24 06:44:25,344 INFO ....JobMaster - Close ResourceManager connection 9c3fc959e219c30fba3395bb7be9823c: JobManager is shutting down..
2019-10-24 06:44:25,344 INFO ....SlotPoolImpl - Stopping SlotPool.
2019-10-24 06:44:25,344 INFO ....jobmaster.JobManagerRunner - JobManagerRunner already shutdown.
2019-10-24 06:44:25,344 INFO ....mesos....MesosResourceManager - Disconnect job manager 00000000000000000000000000000000@akka.tcp://flink@192.168.10.11:6123/user/jobmanager_1 for job 4888c165c1a9eae4dd69cd533c66a5c1 from the resource manager.
为了记录在案,我提交的工作是一个简单的批HelloWorld从http://diegoreico.com/environments/runningflinkgclusterwithzeppelin/
更新时间:2019-10-24-下午1:00
提供以下具有调试级别的日志提取:
flink应用程序主启动:http://www.niceideas.ch/flink/1_flink_app_master_startup.txt
第一个作业执行(成功):https://www.niceideas.ch/flink/2_first_job_execution_successful.txt
任务管理器空闲超时终止:https://www.niceideas.ch/flink/3_task_manager_idle_kill.txt
第二个作业执行(失败):https://www.niceideas.ch/flink/4_second_job_execution_failed.txt
更新时间:2019-10-24-下午5:30
任务管理器(flink worker)的日志可在以下位置获得:https://www.niceideas.ch/flink/5_task_manger_logs.txt
暂无答案!
目前还没有任何答案,快来回答吧!