Ⅰ. Issue Description
seata-server使用etcd3 进行服务注册 在保活因网络问题触发IOException后 没有进行重试而是直接退出保活流程 使得网络问题恢复后 seata无法自动恢复服务
Ⅱ. Describe what happened
保活异常后没有重试 使得etcd中不在存在这个节点 网络恢复后依然不能自动恢复保活
2021-04-27 02:44:09.251 ERROR --- [ registry-etcd3_1_1_2] i.s.d.r.etcd3.EtcdRegistryServiceImpl : EtcdLifeKeeper==> java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at io.seata.discovery.registry.etcd3.EtcdRegistryServiceImpl$EtcdLifeKeeper.process(EtcdRegistryServiceImpl.java:342)
at io.seata.discovery.registry.etcd3.EtcdRegistryServiceImpl$EtcdLifeKeeper.call(EtcdRegistryServiceImpl.java:366)
at io.seata.discovery.registry.etcd3.EtcdRegistryServiceImpl$EtcdLifeKeeper.call(EtcdRegistryServiceImpl.java:322)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:552)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:533)
at io.etcd.jetcd.Util.lambda$toCompletableFutureWithRetry$2(Util.java:140)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 common frames omitted
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.Status.asRuntimeException(Status.java:530)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.etcd.jetcd.ClientConnectionManager$AuthTokenInterceptor$1$1.onClose(ClientConnectionManager.java:302)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
... 3 common frames omitted
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1128)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 common frames omitted
Just paste your stack trace here!
Ⅲ. Describe what you expected to happen
异常出现后自动重试保活 网络恢复后能够自动恢复保活
Ⅳ. How to reproduce it (as minimally and precisely as possible)
- 使用etcd3作为服务发现启动seata-server
- 触发网络问题使得seata-server无法正常访问etcd
- 一段时间之后恢复网络 此时seata-server依然不进行保活 问题触发
Ⅴ. Anything else we need to know?
io.seata.discovery.registry.etcd3.EtcdRegistryServiceImpl.EtcdLifeKeeper#process
这里throw Exception造成的问题 吃掉这个异常应该就可以得到期望行为
补充 实际测试发现吃掉异常并没解决问题 还在继续排查 有新结论时再同步
补充2 排查发现断联超时后 etcd会删除deata-server注册上的key 造成即使吃掉这个异常 接下来的保活依然无效 因为key已经没了 需要重新注册而不是保活
Ⅵ. Environment:
- JDK version :1.8.0_201
- OS :centos7
- Others:
5条答案
按热度按时间67up9zun1#
这里需要的是补偿措施,就跟zk做注册中心,服务向其注册一样
你可以检查下etcd的client端组件是否支持类似zkclient端的CuratorFramework组件一样(实现看上图),支持重连后通知listener,这样就可以网络恢复后自动重新注册
kiayqfof2#
如果不存在,可以用定时任务来检查服务节点存活情况,如果注册上的key不见了,可以定时补偿,也是一种解决方案,如果你有兴趣,这个bug fix 的任务就交给你来处理?有问题我们issue中及时联系讨论
cu6pst1q3#
@xingfudeshi
vxf3dgd44#
看了一下 好像不存在监听链接重连的方法
我这边按照定时补偿来修改了 现在在我的本地环境自测ok
预计这两天能够提交代码
@a364176773
t30tvxxf5#
看了一下 好像不存在监听链接重连的方法
我这边按照定时补偿来修改了 现在在我的本地环境自测ok
预计这两天能够提交代码
@a364176773
感谢你的参与,期待你的pr