apache-kafka Kafka经纪人在指数恢复上花了很长时间,最终关闭

nc1teljy  于 2022-11-01  发布在  Apache
关注(0)|答案(3)|浏览(363)

我在Azure K8S上使用the cp-kafka 5.0.1 helm(它使用the 5.0.1 image)有一个3代理、无副本的Kafka设置。
在某个时候(不幸的是我没有日志),其中一个Kafka代理崩溃了,当它重新启动时,它进入了一个无休止的、痛苦的重新启动循环。它似乎试图恢复某些损坏的日志条目,花了 looong 时间,然后挂断了SIGTERM。更糟糕的是,我不能再完整地消费/生产受影响的主题。下面附上日志,以及一个监视屏幕截图,显示Kafka慢慢地通过日志文件,填充磁盘缓存。
现在,我将log.retention.bytes设置为180 GiB--但我希望保持这种方式,而不使Kafka陷入这种无休止的循环。我怀疑这可能是旧版本的问题,我在KafkaJIRA中搜索了相关的关键字("still starting up""SIGTERM" "corrupted index file"),但一无所获。
因此,我不能依靠更新的版本来解决这个问题,而且我也不想依赖较小的保留大小,因为这可能会弹出大量损坏的日志。
所以我的问题是--有没有办法做到以下任何一项/全部:

  • 阻止SIGTERM的发生,从而让Kafka完全康复?
  • 是否允许在未受影响的分区上恢复使用/生产(30个分区中似乎只有4个分区有损坏的条目)?
  • 否则阻止这疯狂的事情发生?

(If没有什么,我将诉诸:(a)提升Kafka;(B)将log.retention.bytes缩小一个数量级;(c)打开复制品,希望这会有所帮助;(d)改进日志记录,以便首先找出导致崩溃的原因。)

日志

已完成日志加载,但清理+刷新被中断的日志:

[2019-10-10 00:05:36,562 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,598 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 00:05:37,802 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 00:42:27,037] INFO Logs loading complete in 2210438 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,052] INFO Starting log cleanup with a period of 300000 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,054] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,057] INFO Starting the log cleaner (kafka.log.LogCleaner)
[2019-10-10 00:42:27,738] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 00:42:27,763] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)

加载中断的日志:

[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,504 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,549 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 01:55:27,123 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 02:17:01,249] INFO [ProducerStateManager partition=my-topic-12] Loading producer state from snapshot file '/opt/kafka/data-0/logs/my-topic-12/00000000000000004443.snapshot' (kafka.log.ProducerStateManager)
[2019-10-10 02:17:07,090] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 02:17:07,093] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Closing BaseMetricsReporter (io.confluent.support.metrics.BaseMetricsReporter)
[2019-10-10 02:17:07,093] INFO Waiting for metrics thread to exit (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Shutting down KafkaServer (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,097] INFO [KafkaServer id=2] shutting down (kafka.server.KafkaServer)
[2019-10-10 02:17:07,105] ERROR [KafkaServer id=2] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
    at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
    at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)
[2019-10-10 02:17:07,110] ERROR Caught exception when trying to shut down KafkaServer. Exiting forcefully. (io.confluent.support.metrics.SupportedServerStartable)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
    at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
    at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)

监测

bfhwhh0e

bfhwhh0e1#

我在寻找类似问题的解决方案时发现了您的问题。
我想知道你是否解决了这个问题??
在此期间,谁在调用SIGTERM?可能是Kubernetes或其他编制者,您可以修改就绪探测器,以允许在它杀死容器之前进行更多尝试。
还要确保你的xmx配置小于pod/container分配的资源。否则Kubernetes会杀死这个pod(如果Kubernetes是这里的情况)

kyxcudwk

kyxcudwk2#

我遇到了同样的问题,我通过增加kafka config(server.properties文件)中的两个值来解决:
zookeeper.connection.timeout.ms
zookeeper.session.timeout.ms
我把它们都加到了18000。两个都有相同的值似乎没有用(至少根据https://kafka.apache.org/documentation/#zookeeper.connection.timeout.ms)。但无论如何,它为我解决了这个问题。

8nuwlpux

8nuwlpux3#

我在使用bitnami kafka chart时也遇到了类似的问题。

[2022-10-25 11:07:49,596] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2022-10-25 11:07:49,605] INFO [KafkaServer id=0] shutting down (kafka.server.KafkaServer)
[2022-10-25 11:07:49,609] ERROR [KafkaServer id=0] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:705)
    at kafka.Kafka$.$anonfun$main$3(Kafka.scala:100)
    at kafka.utils.Exit$.$anonfun$addShutdownHook$1(Exit.scala:38)
    at java.base/java.lang.Thread.run(Thread.java:829)
[2022-10-25 11:07:49,611] ERROR Halting Kafka. (kafka.Kafka$)

增加了livenessProbe.initialDelaySeconds,它工作正常。LivenessProbe由于加载Kafka代理中的现有主题快照而失败。
但我不明白为什么会出现SIGTERM信号问题!

相关问题