Zookeeper Kafka代理在会话超时后未尝试重新启动

jk9hmnmh  于 2022-12-16  发布在  Apache
关注(0)|答案(2)|浏览(572)

I've 3 brokers containers and 1 zookeeper container running in a docker stack, both zookeeper container and brokers containers keeps stopping after some days (less than a week) running in an idle state.

  • This is one of the brokers logs where i see an error, but i can not identify how to handle it*
[2022-07-13 02:22:30,109] INFO [UnifiedLog partition=messages-processed-0, dir=/bitnami/kafka/data] Truncating to 6 has no effect as the largest offset in the log is 5 (kafka.log.UnifiedLog)
[2022-07-13 02:23:33,766] INFO [Controller id=1] Newly added brokers: , deleted brokers: 2, bounced brokers: , all live brokers: 1,3 (kafka.controller.KafkaController)
[2022-07-13 02:23:33,766] INFO [RequestSendThread controllerId=1] Shutting down (kafka.controller.RequestSendThread)
[2022-07-13 02:23:33,853] INFO [RequestSendThread controllerId=1] Stopped (kafka.controller.RequestSendThread)
[2022-07-13 02:23:33,853] INFO [RequestSendThread controllerId=1] Shutdown completed (kafka.controller.RequestSendThread)
[2022-07-13 02:23:34,226] INFO [Controller id=1] Broker failure callback for 2 (kafka.controller.KafkaController)
[2022-07-13 02:23:34,227] INFO [Controller id=1 epoch=3] Sending UpdateMetadata request to brokers Set() for 0 partitions (state.change.logger)
[2022-07-13 02:23:36,414] ERROR [Controller id=1 epoch=3] Controller 1 epoch 3 failed to change state for partition __consumer_offsets-30 from OfflinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition __consumer_offsets-30 under strategy OfflinePartitionLeaderElectionStrategy(false)
    at kafka.controller.ZkPartitionStateMachine.$anonfun$doElectLeaderForPartitions$7(PartitionStateMachine.scala:424)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at kafka.controller.ZkPartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:421)
    at kafka.controller.ZkPartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:332)
    at kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:238)
    at kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:158)
    at kafka.controller.PartitionStateMachine.triggerOnlineStateChangeForPartitions(PartitionStateMachine.scala:74)
    at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:59)
    at kafka.controller.KafkaController.onReplicasBecomeOffline(KafkaController.scala:627)
    at kafka.controller.KafkaController.onBrokerFailure(KafkaController.scala:597)
    at kafka.controller.KafkaController.processBrokerChange(KafkaController.scala:1621)
    at kafka.controller.KafkaController.process(KafkaController.scala:2495)
    at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:52)
    at kafka.controller.ControllerEventManager$ControllerEventThread.process$1(ControllerEventManager.scala:130)
    at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:133)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
    at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:133)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
  • This is a part of the zookeeper log at the same time*
2022-07-13 02:14:45,002 [myid:] - INFO  [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba7760000, timeout of 18000ms exceeded
2022-07-13 02:15:13,001 [myid:] - INFO  [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba7760007, timeout of 18000ms exceeded
2022-07-13 02:15:29,832 [myid:] - INFO  [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760006 for client /172.18.0.5:55356, probably expired
2022-07-13 02:15:42,419 [myid:] - INFO  [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760000 for client /172.18.0.4:59474, probably expired
2022-07-13 02:15:52,350 [myid:] - INFO  [NIOWorkerThread-2:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760007 for client /172.18.0.6:34406, probably expired
2022-07-13 02:16:49,001 [myid:] - INFO  [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba776000b, timeout of 18000ms exceeded
2022-07-13 02:17:12,434 [myid:] - INFO  [NIOWorkerThread-2:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba776000b for client /172.18.0.5:56264, probably expired
2022-07-13 02:19:17,067 [myid:] - WARN  [NIOWorkerThread-1:o.a.z.s.NIOServerCnxn@371] - Unexpected exception
org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.18.0.4:60150, session = 0x10000fba776000d
    at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:170)
    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:333)
    at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
    at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
2022-07-13 02:23:29,002 [myid:] - INFO  [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba776000e, timeout of 18000ms exceeded
2022-07-13 02:24:05,059 [myid:] - INFO  [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba776000e for client /172.18.0.5:32886, probably expired
2022-07-13 03:48:55,209 [myid:] - WARN  [NIOWorkerThread-2:o.a.z.s.NIOServerCnxn@371] - Unexpected exception
org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.18.0.4:33508, session = 0x10000fba776000d
    at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:170)
    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:333)
    at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
    at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

I've seen that a maxSessionTimeout of 40000ms is set on my zoo.cfg (zookeeper side) and a timeout of 18000ms is set on the server.properties at broker side, should i increment one of those?

**kafka-topic --describe ** about one of the topics that fell https://prnt.sc/r3hU5wv3jK-h

Image used for the broker container

46scxncf

46scxncf1#

在你上传的图片中,brokerId=1似乎自己崩溃了,可能和你显示的日志无关。
您确实显示的日志(“Failed to elect leader”)发生在other代理上,因为ReplicationFactor: 1__consumer_offsets上。这就是为什么您有Leader: None, Replicas: 1。在一个健康的Kafka集群中,分区领导者不应该是none。
This can be fixed for an existing Kafka cluster使用kafka-reassign-partitions,但由于您使用的是Docker,修复此问题的最佳方法是首先清除所有数据,停止容器,然后添加缺少的KAFKA_CFG_OFFSETS_TOPIC_REPLICATION_FACTOR=3环境变量并重新启动所有容器。
或者像评论中提到的那样,不要在一台主机上使用3个代理,如果你使用的是一台主机,超时并不重要,因为所有的网络请求都是本地的。

pxq42qpu

pxq42qpu2#

You can try the following.

docker volume rm sentry-kafka
docker volume rm sentry-zookeeper
docker volume rm sentry_onpremise_sentry-kafka-log
docker volume rm sentry_onpremise_sentry-zookeeper-log

./install.sh // to create Kafka partitions

相关问题