While reading the ZooKeeper's recipe for lock, I got confused. It seems that this recipe for distributed locks can not guarantee "any snapshot in time no two clients think they hold the same lock". But since ZooKeeper is so widely adopted, if there were such mistakes in the reference documentation, someone should have pointed it out long ago, so what did I misunderstand?
Quoting the recipe for distributed locks:
Locks
Fully distributed locks that are globally synchronous, meaning at any snapshot in time no two clients think they hold the same lock. These can be implemented using ZooKeeeper. As with priority queues, first define a lock node.
- Call create( ) with a pathname of "locknode/guid-lock-" and the sequence and ephemeral flags set.
- Call getChildren( ) on the lock node without setting the watch flag (this is important to avoid the herd effect).
- If the pathname created in step 1 has the lowest sequence number suffix, the client has the lock and the client exits the protocol.
- The client calls exists( ) with the watch flag set on the path in the lock directory with the next lowest sequence number.
- if exists( ) returns false, go to step 2. Otherwise, wait for a notification for the pathname from the previous step before going to step 2.
Consider the following case:
- Client1 successfully acquired the lock (in step 3), with ZooKeeper node "locknode/guid-lock-0";
- Client2 created node "locknode/guid-lock-1", failed to acquire the lock, and is now watching "locknode/guid-lock-0";
- Later, for some reason (say, network congestion), Client1 fails to send a heartbeat message to the ZooKeeper cluster on time, but Client1 is still working away, mistakenly assuming that it still holds the lock.
- But, ZooKeeper may think Client1's session is timed out, and then
- delete "locknode/guid-lock-0",
- send a notification to Client2 (or maybe send the notification first?),
- but can not send a "session timeout" notification to Client1 in time (say, due to network congestion).
- Client2 gets the notification, goes to step 2, gets the only node ""locknode/guid-lock-1", which it created itself; thus, Client2 assumes it hold the lock.
- But at the same time, Client1 assumes it holds the lock.
Is this a valid scenario?
3条答案
按热度按时间vdzxcuhz1#
您描述的场景可能会出现:客户端1认为它拥有锁,但实际上它的会话已经超时,客户端2获得了锁。
ZooKeeper客户端库会通知客户端1它的连接已经断开(但是客户端在连接到服务器之前并不知道会话已经过期),所以客户端可以写一些代码,假设他的锁已经丢失,如果他已经断开太长时间。但是使用锁的线程需要定期检查锁是否仍然有效,这本身就是活泼的。
w6lpcovy2#
从Zookeeper文档中:
所以我不认为你描述的问题会出现。在我看来,如果创建锁的客户端发生了什么事情,可能会有挂锁的风险,但你描述的情况不应该出现。
nr9pn0ug3#
from packt book - Zookeeper Essentials
If there was a partial failure in the creation of znode due to connection loss, it's possible that the client won't be able to correctly determine whether it successfully created the child znode. To resolve such a situation, the client can store its session ID in the znode data field or even as a part of the znode name itself. As a client retains the same session ID after a reconnect, it can easily determine whether the child znode was created by it by looking at the session ID.