Kafka1.1.1-cp1(编辑4:我最终向Kafka提交了一个关于这个的错误-https://issues.apache.org/jira/browse/kafka-7447)
我有三个经纪人 min.insync.replicas = 2
所有主题,以及 offsets.commit.required.acks = -1
.
正如您所料,当我停止某个代理时,它就不再负责它所负责的分区,一切都正常进行(消费者消费,生产者生产)。
当我把经纪人带回来的时候问题就开始了。似乎是这样的,它在集群中引起了混乱,一些消费偏移主题会立即被截断为0。
下面是从受影响的消费偏移分区(一个最初由崩溃的代理引导的分区)中按时间顺序选择的日志。这件事在三家经纪商的日志中都得到了证实。
从本质上说,我反弹回来的代理,似乎不明白新的领导者是什么意思,截断为0,然后说服其他副本也截断为0。
prod-kafka-2:(刚刚启动)
[2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Based on follower's leader epoch, leader replied with an unknown offset in __consumer_offsets-29. The initial fetch offset 0 will be used for truncation. (kafka.server.ReplicaFetcherThread)
prod-kafka-3:(见replica1回来)
[2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition)
产品-Kafka-2:
[2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling unloading of offsets and group metadata from __consumer_offsets-29 (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:24:03,287] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions __consumer_offsets-29 (kafka.server.ReplicaFetcherManager)
[2018-09-17 09:24:03,287] INFO [Partition __consumer_offsets-29 broker=1] __consumer_offsets-29 starts at Leader Epoch 78 from offset 0. Previous Leader Epoch was: 77 (kafka.cluster.Partition)
[2018-09-17 09:24:03,287] INFO [GroupMetadataManager brokerId=1] Scheduling loading of offsets and group metadata from __consumer_offsets-29 (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:24:03,288] INFO [GroupMetadataManager brokerId=1] Finished loading offsets and group metadata from __consumer_offsets-29 in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
prod-kafka-3:难以同意prod-kafka-2。把它踢出isr,然后和Zookeeper打架。也许2号和3号都认为自己是领头羊?
[2018-09-17 09:24:15,372] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:15,377] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
prod-kafka-2:粗暴地将另外两个复制品从isr列表中踢出,尽管2是我们刚刚重新启动的,因此很可能是落后的(请记住,它已经决定将主题截断为0!)
[2018-09-17 09:24:16,481] INFO [Partition __consumer_offsets-29 broker=1] Shrinking ISR from 0,2,1 to 1 (kafka.cluster.Partition)
prod-kafka-3:仍在与Zookeeper战斗。最终输了。
[2018-09-17 09:24:20,374] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:20,378] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:25,347] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:25,350] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:30,359] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:30,362] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:35,365] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:35,368] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:40,352] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:40,354] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:45,422] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:45,425] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:50,345] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:50,348] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:24:55,444] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:24:55,449] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:00,340] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:00,343] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:05,374] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:05,377] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:10,342] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:10,344] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:15,348] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:15,351] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:20,338] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:20,340] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:25,338] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:25,340] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:30,382] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:30,387] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:35,341] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:35,344] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:40,460] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:40,465] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
[2018-09-17 09:25:45,335] INFO [Partition __consumer_offsets-29 broker=2] Shrinking ISR from 0,2,1 to 0,2 (kafka.cluster.Partition)
[2018-09-17 09:25:45,338] INFO [Partition __consumer_offsets-29 broker=2] Cached zkVersion [1582] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
prod-kafka-1:突然变得混乱,也重新初始化为0,因为prod-kafka-2显然成为了领导者。
[2018-09-17 09:25:48,807] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Remote broker is not the leader for partition __consumer_offsets-29, which could indicate that the partition is being moved (kafka.server.ReplicaFetcherThread)
prod-kafka-3:最终决定由prod-kafka-2负责,并相应截断
[2018-09-17 09:25:48,806] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions __consumer_offsets-29 (kafka.server.ReplicaFetcherManager)
[2018-09-17 09:25:48,807] INFO [ReplicaFetcherManager on broker 2] Added fetcher for partitions List([__consumer_offsets-29, initOffset 0 to broker BrokerEndPoint(1,prod-kafka-2.c.i-lastfm-prod.internal,9092)] ) (kafka.server.ReplicaFetcherManager)
[2018-09-17 09:25:48,809] INFO [GroupMetadataManager brokerId=2] Scheduling unloading of offsets and group metadata from __consumer_offsets-29 (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:25:48,810] INFO [GroupMetadataManager brokerId=2] Finished unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:25:48,950] WARN [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Based on follower's leader epoch, leader replied with an unknown offset in __consumer_offsets-29. The initial fetch offset 0 will be used for truncation. (kafka.server.ReplicaFetcherThread)
[2018-09-17 09:25:48,951] INFO [Log partition=__consumer_offsets-29, dir=/var/lib/kafka/data] Truncating to 0 has no effect as the largest offset in the log is -1 (kafka.log.Log)
产品-Kafka-1:领导就职确认。
[2018-09-17 09:25:50,207] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Remote broker is not the leader for partition __consumer_offsets-29, which could indicate that the partition is being moved (kafka.server.ReplicaFetcherThread)
prod-kafka-2:现在它已经通过zookeeper宣布了它的统治地位,prod-kafka-3被添加到isr列表中
[2018-09-17 09:25:50,210] INFO [Partition __consumer_offsets-29 broker=1] Expanding ISR from 1 to 1,2 (kafka.cluster.Partition)
prod-kafka-1:仍在努力接受现实,但最终也被截断为0。
[2018-09-17 09:25:51,430] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Remote broker is not the leader for partition __consumer_offsets-29, which could indicate that the partition is being moved (kafka.server.ReplicaFetcherThread)
[2018-09-17 09:25:52,615] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Remote broker is not the leader for partition __consumer_offsets-29, which could indicate that the partition is being moved (kafka.server.ReplicaFetcherThread)
[2018-09-17 09:25:53,637] INFO [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Remote broker is not the leader for partition __consumer_offsets-29, which could indicate that the partition is being moved (kafka.server.ReplicaFetcherThread)
[2018-09-17 09:25:54,150] INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions __consumer_offsets-29 (kafka.server.ReplicaFetcherManager)
[2018-09-17 09:25:54,151] INFO [ReplicaFetcherManager on broker 0] Added fetcher for partitions List([__consumer_offsets-29, initOffset 0 to broker BrokerEndPoint(1,prod-kafka-2.c.i-lastfm-prod.internal,9092)] ) (kafka.server.ReplicaFetcherManager)
[2018-09-17 09:25:54,151] INFO [GroupMetadataManager brokerId=0] Scheduling unloading of offsets and group metadata from __consumer_offsets-29 (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:25:54,153] INFO [GroupMetadataManager brokerId=0] Finished unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached groups. (kafka.coordinator.group.GroupMetadataManager)
[2018-09-17 09:25:54,261] WARN [ReplicaFetcher replicaId=0, leaderId=1, fetcherId=0] Based on follower's leader epoch, leader replied with an unknown offset in __consumer_offsets-29. The initial fetch offset 0 will be used for truncation. (kafka.server.ReplicaFetcherThread)
[2018-09-17 09:25:54,261] INFO [Log partition=__consumer_offsets-29, dir=/var/lib/kafka/data] Truncating to 0 has no effect as the largest offset in the log is -1 (kafka.log.Log)
prod-kafka-2:完成消费者补偿的政变,现在全部为0。
[2018-09-17 09:25:56,244] INFO [Partition __consumer_offsets-29 broker=1] Expanding ISR from 1,2 to 1,2,0 (kafka.cluster.Partition)
编辑:
根据要求,这里是kafka server.properties文件:
broker.id=1
default.replication.factor=3
auto.create.topics.enable=false
min.insync.replicas=2
num.network.threads=12
num.io.threads=16
num.replica.fetchers=6
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/var/lib/kafka/data
num.partitions=1
num.recovery.threads.per.data.dir=4
offsets.retention.minutes=10080
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2
log.flush.interval.messages=20000
log.flush.interval.ms=10000
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=60000
zookeeper.connect=prod-kafka-1:2181,prod-kafka-2:2181,prod-kafka-3:2181
zookeeper.connection.timeout.ms=6000
confluent.support.metrics.enable=false
confluent.support.customer.id=anonymous
group.initial.rebalance.delay.ms=3000
下面是zookeeper.properties文件:
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/lib/zookeeper
clientPort=2181
server.1=prod-kafka-1:2888:3888
server.2=prod-kafka-2:2888:3888
server.3=prod-kafka-3:2888:3888
autopurge.purgeInterval=12
autopurge.snapRetainCount=6
edit2升级到kafka-2.0.0似乎没有解决问题。
可能是我的传入率太高,当我知道崩溃的服务器即将恢复时,我需要限制生产者?听起来对吗…?
编辑3设置 auto.leader.rebalance.enable=false
解决了问题,但现在我必须手动重新平衡。但是,当所有分区都被占用时手动重新平衡似乎不会带来任何问题。
1条答案
按热度按时间zsbz8rwp1#
自从提出这个问题以来,已经取得了一些进展。如评论所示,创建jira kafka-7447是为了跟踪这一问题。虽然这个问题还没有解决,但有几个人提到这个问题与Kafka8896有关,现在已经解决了(而那些遇到原始问题的人,在使用包含此改进的版本后,就不再有问题了。)
因此,这个问题是固定的,使用kafka 2.2.2及更高版本(或任何带有该修补程序的版本)应该可以确保您不会遇到这个问题。