我们正在kubernetes和azure上运行一个5节点的flink集群(每个8gbram,总共40个插槽)。我们正在运行四个作业,所有作业都使用来自kafka的数据(每个作业在不同的消费群体上)。几天前,随着数据负载的增加,我们将producer移到5个kafka分区上生成数据,并将作业并行性移到5个。从那时起,我们的一个任务管理器时不时(平均每小时)出现以下异常:
NFO|N||-|||Flink-4jc| 2019-01-22 16:00:32,032 Task:917 - org=[] - Map (2/5) (949a8349e7bdcf3fe3b8f992f52d249c) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.partition.PartitionNotFoundException: Partition 86656e59799eb529f24bac704ea06790@b1955e1a072e3b2f9e1f969fea509841 not found.
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.failPartitionRequest(RemoteInputChannel.java:273)
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.retriggerSubpartitionRequest(RemoteInputChannel.java:182)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.retriggerPartitionRequest(SingleInputGate.java:400)
at org.apache.flink.runtime.taskmanager.Task.onPartitionStateUpdate(Task.java:1293)
at org.apache.flink.runtime.taskmanager.Task.lambda$triggerPartitionProducerStateCheck$1(Task.java:1150)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
例外情况发生在不同的任务和不同的作业上。我读过以下帖子:http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/partitionnotfoundexception-when-running-in-yarn-session-td16081.html 这给了我一些可能导致异常的提示,但在我的例子中,我仍然无法找出导致异常的原因(增加超时和网络缓冲区大小没有帮助,我不明白为什么jar文件的大小很重要)
有人能告诉我如何调查发生了什么,我应该打开什么日志,要更改什么配置等等吗?如果需要其他细节,我很乐意提供。
谢谢!
1条答案
按热度按时间4uqofj5v1#
也许你的网卡已经满了