Hi there,
I am trying to understand the internal mechanism of how PXC nodes monitor each other, handle network delays, and not responding and dead nodes. I read about the evs parameters, and I would appreciate it if someone could validate if my understanding of the following workflow is correct:
Delays
evs.inactive_check_period: It should define how often a node checks its peers. If a node detects a delay in response during this check, does it immediately try to add the peer to a “delayed list”?
Relaying Messages
If a node is unreachable (post peer_timeout), cluster should enable message relaying - sending messages via other nodes
Does the node wait for the evs.delayed_margin time before formally adding the problematic node to the delayed_list?
Suspect and Dead
evs.suspect_timeout: My understanding is that when all nodes vote on a node’s inactivity and reach this timeout, the node is pronounced dead
evs.inactive_timeout: I assume this shouold be “hard limit”. Unlike suspect_timeout which requires consensus/voting, does inactive_timeout allow a node to mark a peer as DEAD locally without waiting for full consensus if it simply doesn’t respond at all?
Recovery
evs.delayed_keep_period: If a node that was marked as delayed/dead becomes active again, does the cluster wait for this specific period before removing it from the delayed list?
Feel free to point me to the any related study material about handling network latency.