分布式Erlang -“防止重叠分区”算法是如何工作的?

cgfeq70w  于 2022-12-08  发布在  Erlang

自OTP 25起,全局默认情况下将通过主动断开报告与其他节点失去连接的节点来防止由于网络问题而导致的分区重叠。这将导致形成完全连接的分区,而不是使网络处于分区重叠的状态。

Fully connected          After fault           Scenario 1            Scenario 2     
     A                        A                     A                     A
   /   \                     / \                   /                       \
  /     \                   /   \                 /                         \
 /       \                 /     \               /                           \
B---------C               B       C             B       C             B       C





%% ----------------------------------------------------------------
%% Prevent Overlapping Partitions Algorithm
%% ========================================
%% 1. When a node lose connection to another node it sends a
%%    {lost_connection, LostConnNode, OtherNode} message to all
%%    other nodes that it knows of.
%% 2. When a lost_connection message is received the receiver
%%    first checks if it has seen this message before. If so, it
%%    just ignores it. If it has not seen it before, it sends the
%%    message to all nodes it knows of. This in order to ensure
%%    that all connected nodes will receive this message. It then
%%    sends a {remove_connection, LostConnRecvNode} message (where
%%    LostConnRecvNode is its own node name) to OtherNode and
%%    clear all information about OtherNode so OtherNode wont be
%%    part of ReceiverNode's cluster anymore. When this information
%%    has been cleared, no lost_connection will be triggered when
%%    a nodedown message for the connection to OtherNode is
%%    received.
%% 3. When a {remove_connection, LostConnRecvNode} message is
%%    received, the receiver node takes down the connection to
%%    LostConnRecvNode and clears its information about
%%    LostConnRecvNode so it is not part of its cluster anymore.
%%    Both nodes will receive a nodedown message due to the
%%    connection being closed, but none of them will send
%%    lost_connection messages since they have cleared information
%%    about the other node.
