akka群集不发送MemberLeft事件

dxxyhpgq  于 2023-08-05  发布在  其他
关注(0)|答案(1)|浏览(93)

我使用MemeberLeft事件清理遗留数据。我找到节点(Ip:192.168.12.212)离开并新建节点(Ip:192.168.12.250)启动,但找不到MemberLeft事件日志。只能找到MemberJoined事件。节点(IP:192.168.12.212)可能有JVM问题。
日志如下:

[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,195[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-20[0;39m] [35ma.m.c.b.i.BootstrapCoordinator[0;39m - [36mLooking up [Lookup(guandata-server,None,Some(tcp))]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,195[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-20[0;39m] [35ma.d.k.KubernetesApiServiceDiscovery[0;39m - [36mQuerying for pods with label selector: [app=guandata-server]. Namespace: [default]. Port: [None]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,330[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-3[0;39m] [35ma.m.c.b.i.BootstrapCoordinator[0;39m - [36mLocated service members based on: [Lookup(guandata-server,None,Some(tcp))]: [ResolvedTarget(192-168-64-250.default.pod.cluster.local,None,Some(/192.168.64.250)), ResolvedTarget(192-168-66-128.default.pod.cluster.local,None,Some(/192.168.66.128)), ResolvedTarget(192-168-65-229.default.pod.cluster.local,None,Some(/192.168.65.229))], filtered to [192-168-64-250.default.pod.cluster.local:0, 192-168-66-128.default.pod.cluster.local:0, 192-168-65-229.default.pod.cluster.local:0]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,351[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-26[0;39m] [35ma.m.c.b.i.BootstrapCoordinator[0;39m - [36mLocated service members based on: [Lookup(guandata-server,None,Some(tcp))]: [ResolvedTarget(192-168-64-250.default.pod.cluster.local,None,Some(/192.168.64.250)), ResolvedTarget(192-168-66-128.default.pod.cluster.local,None,Some(/192.168.66.128)), ResolvedTarget(192-168-65-229.default.pod.cluster.local,None,Some(/192.168.65.229))], filtered to [192-168-64-250.default.pod.cluster.local:0, 192-168-66-128.default.pod.cluster.local:0, 192-168-65-229.default.pod.cluster.local:0]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,445[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ma.m.c.b.i.BootstrapCoordinator[0;39m - [36mContact point [akka://ClusterSystem@192.168.66.128:25520] returned [2] seed-nodes [akka://ClusterSystem@192.168.65.229:25520, akka://ClusterSystem@192.168.66.128:25520]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,461[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ma.m.c.b.i.BootstrapCoordinator[0;39m - [36mContact point [akka://ClusterSystem@192.168.65.229:25520] returned [2] seed-nodes [akka://ClusterSystem@192.168.65.229:25520, akka://ClusterSystem@192.168.66.128:25520]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,465[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ma.m.c.b.i.BootstrapCoordinator[0;39m - [36mJoining [akka://ClusterSystem@192.168.64.250:25520] to existing cluster [akka://ClusterSystem@192.168.65.229:25520, akka://ClusterSystem@192.168.66.128:25520]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,633[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ma.m.c.b.c.HttpClusterBootstrapRoutes[0;39m - [36mBootstrap request from 192.168.64.250:49804: Contact Point returning 0 seed-nodes []
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,736[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ma.a.LocalActorRef[0;39m - [36mMessage [akka.management.cluster.bootstrap.contactpoint.HttpBootstrapJsonProtocol$SeedNodes] from Actor[akka://ClusterSystem/system/bootstrapCoordinator/contactPointProbe-192-168-64-250.default.pod.cluster.local-8558#-1278944557] to Actor[akka://ClusterSystem/system/bootstrapCoordinator/contactPointProbe-192-168-64-250.default.pod.cluster.local-8558#-1278944557] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://ClusterSystem/system/bootstrapCoordinator/contactPointProbe-192-168-64-250.default.pod.cluster.local-8558#-1278944557] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:23,984[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-20[0;39m] [35ma.c.Cluster[0;39m - [36mCluster Node [akka://ClusterSystem@192.168.64.250:25520] - Received InitJoinAck message from [Actor[akka://ClusterSystem@192.168.66.128:25520/system/cluster/core/daemon#-1974909478]] to [akka://ClusterSystem@192.168.64.250:25520]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,094[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-3[0;39m] [35ma.c.Cluster[0;39m - [36mCluster Node [akka://ClusterSystem@192.168.64.250:25520] - Welcome from [akka://ClusterSystem@192.168.66.128:25520]
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,110[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m===> message received: MemberJoined(Member(akka://ClusterSystem@192.168.64.250:25520, Joining))
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,115[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m
[Cluster State] leaders: List()
[Cluster State] members: List(Member(akka://ClusterSystem@192.168.64.250:25520, Joining), Member(akka://ClusterSystem@192.168.65.229:25520, Up), Member(akka://ClusterSystem@192.168.66.128:25520, Up))
[Cluster State] unreachable: List()

[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,124[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m[MemberUp] ===> 192.168.65.229 up cluster
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,124[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m
[Cluster State] leaders: List()
[Cluster State] members: List(Member(akka://ClusterSystem@192.168.64.250:25520, Joining), Member(akka://ClusterSystem@192.168.65.229:25520, Up), Member(akka://ClusterSystem@192.168.66.128:25520, Up))
[Cluster State] unreachable: List()

[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,124[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m[MemberUp] ===> 192.168.66.128 up cluster
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,125[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-36[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m
[Cluster State] leaders: List()
[Cluster State] members: List(Member(akka://ClusterSystem@192.168.64.250:25520, Joining), Member(akka://ClusterSystem@192.168.65.229:25520, Up), Member(akka://ClusterSystem@192.168.66.128:25520, Up))
[Cluster State] unreachable: List()

[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:24,988[0;39m [[31mmain[0;39m] [35mc.g.f.i.GuandataFileSystemFactory[0;39m - [36mfile:/// fileSystem has been created ......
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:25,070[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-3[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m[MemberUp] ===> 192.168.64.250 up cluster
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:25,071[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-20[0;39m] [35ma.c.s.SplitBrainResolver[0;39m - [36mThis node is now the leader responsible for taking SBR decisions among the reachable nodes (more leaders may exist).
[0;39m[[1;31mINFO[0;39m] [34m2023-07-11 08:13:25,070[0;39m [[31mClusterSystem-akka.actor.default-dispatcher-3[0;39m] [35ms.c.MemberEventMonitorActor$[0;39m - [36m
[Cluster State] leaders: List(akka://ClusterSystem@192.168.65.229:25520)
[Cluster State] members: List(Member(akka://ClusterSystem@192.168.64.250:25520, Up), Member(akka://ClusterSystem@192.168.65.229:25520, Up), Member(akka://ClusterSystem@192.168.66.128:25520, Up))
[Cluster State] unreachable: List()

字符串
我想知道为什么akka集群不发送MemberEvent。我可以使用哪个事件来判断节点状态?

c86crjj0

c86crjj01#

MemberLeft只在节点a从UpLeaving时发生,这是“优雅退出”。如果节点没有优雅地离开(例如,运行它的JVM崩溃,或者网络问题中断了连接,或者由于工作负载(或GC暂停或CPU超额预订......)而忙碌而无法发送心跳,那么它将通过Down通过不同的路径。
MemberRemoved事件可能就是您要查找的,特别是如果计划从集群的其他节点运行清理,无论删除是否正常。
请注意,如果整个群集都关闭,则可能不会收到MemberRemoved事件,并且在非优雅情况下,无法内在地保证受影响的节点实际上已经停止(理想情况下,节点具有默认的run-coordinated-shutdown-when-down = on,但从技术上讲,无法保证在任何有限的时间内发生:考虑长时间GC暂停或挂起/恢复将导致什么)。在集群中的每个节点都崩溃的情况下,可能需要手动清理(或者如果清理不是正确操作的严格要求,您可以忍受它),第二,如果清理工作会在宕机节点尚未停止的情况下混乱,延迟实际清理可能是一个好主意。

相关问题