Network latency, delays, and related factors

Hi there,

I am trying to understand the internal mechanism of how PXC nodes monitor each other, handle network delays, and not responding and dead nodes. I read about the evs parameters, and I would appreciate it if someone could validate if my understanding of the following workflow is correct:

Delays

evs.inactive_check_period: It should define how often a node checks its peers. If a node detects a delay in response during this check, does it immediately try to add the peer to a “delayed list”?

Relaying Messages

If a node is unreachable (post peer_timeout), cluster should enable message relaying - sending messages via other nodes

Does the node wait for the evs.delayed_margin time before formally adding the problematic node to the delayed_list?

Suspect and Dead

evs.suspect_timeout: My understanding is that when all nodes vote on a node’s inactivity and reach this timeout, the node is pronounced dead

evs.inactive_timeout: I assume this shouold be “hard limit”. Unlike suspect_timeout which requires consensus/voting, does inactive_timeout allow a node to mark a peer as DEAD locally without waiting for full consensus if it simply doesn’t respond at all?

Recovery

evs.delayed_keep_period: If a node that was marked as delayed/dead becomes active again, does the cluster wait for this specific period before removing it from the delayed list?

Feel free to point me to the any related study material about handling network latency.

Hi @SQLCaesar,
Codership, the company behind the Galera library which PXC uses for communication, was recently purchased by MariaDB and all of their documentation was moved to mariadb’s website. Have a look at all of the ‘wsrep’ parameters and their docs. wsrep_provider_options | Galera Cluster | MariaDB Documentation