seata Etcd3服务注册保活出现异常后没有重试 造成网络恢复后无法自动恢复服务

2eafrhcq  于 2个月前  发布在  Etcd
关注(0)|答案(5)|浏览(71)

Ⅰ. Issue Description

seata-server使用etcd3 进行服务注册 在保活因网络问题触发IOException后 没有进行重试而是直接退出保活流程 使得网络问题恢复后 seata无法自动恢复服务

Ⅱ. Describe what happened

保活异常后没有重试 使得etcd中不在存在这个节点 网络恢复后依然不能自动恢复保活
2021-04-27 02:44:09.251 ERROR --- [ registry-etcd3_1_1_2] i.s.d.r.etcd3.EtcdRegistryServiceImpl : EtcdLifeKeeper==> java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at io.seata.discovery.registry.etcd3.EtcdRegistryServiceImpl$EtcdLifeKeeper.process(EtcdRegistryServiceImpl.java:342)
at io.seata.discovery.registry.etcd3.EtcdRegistryServiceImpl$EtcdLifeKeeper.call(EtcdRegistryServiceImpl.java:366)
at io.seata.discovery.registry.etcd3.EtcdRegistryServiceImpl$EtcdLifeKeeper.call(EtcdRegistryServiceImpl.java:322)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:552)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:533)
at io.etcd.jetcd.Util.lambda$toCompletableFutureWithRetry$2(Util.java:140)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 common frames omitted
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.Status.asRuntimeException(Status.java:530)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:482)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.etcd.jetcd.ClientConnectionManager$AuthTokenInterceptor$1$1.onClose(ClientConnectionManager.java:302)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
... 3 common frames omitted
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1128)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 common frames omitted

Just paste your stack trace here!

Ⅲ. Describe what you expected to happen

异常出现后自动重试保活 网络恢复后能够自动恢复保活

Ⅳ. How to reproduce it (as minimally and precisely as possible)

  1. 使用etcd3作为服务发现启动seata-server
  2. 触发网络问题使得seata-server无法正常访问etcd
  3. 一段时间之后恢复网络 此时seata-server依然不进行保活 问题触发

Ⅴ. Anything else we need to know?

io.seata.discovery.registry.etcd3.EtcdRegistryServiceImpl.EtcdLifeKeeper#process
这里throw Exception造成的问题 吃掉这个异常应该就可以得到期望行为
补充 实际测试发现吃掉异常并没解决问题 还在继续排查 有新结论时再同步
补充2 排查发现断联超时后 etcd会删除deata-server注册上的key 造成即使吃掉这个异常 接下来的保活依然无效 因为key已经没了 需要重新注册而不是保活

Ⅵ. Environment:

  • JDK version :1.8.0_201
  • OS :centos7
  • Others:
67up9zun

67up9zun1#

这里需要的是补偿措施,就跟zk做注册中心,服务向其注册一样

你可以检查下etcd的client端组件是否支持类似zkclient端的CuratorFramework组件一样(实现看上图),支持重连后通知listener,这样就可以网络恢复后自动重新注册

kiayqfof

kiayqfof2#

如果不存在,可以用定时任务来检查服务节点存活情况,如果注册上的key不见了,可以定时补偿,也是一种解决方案,如果你有兴趣,这个bug fix 的任务就交给你来处理?有问题我们issue中及时联系讨论

vxf3dgd4

vxf3dgd44#

看了一下 好像不存在监听链接重连的方法
我这边按照定时补偿来修改了 现在在我的本地环境自测ok
预计这两天能够提交代码
@a364176773

t30tvxxf

t30tvxxf5#

看了一下 好像不存在监听链接重连的方法
我这边按照定时补偿来修改了 现在在我的本地环境自测ok
预计这两天能够提交代码
@a364176773

感谢你的参与,期待你的pr

相关问题