PXC connection timeout in multidc environment

Hi

I have 5 node PXC cluster, 2 nodes in DC1, 3 nodes in DC2. I use GTM to manage active-passive cluster setup.
Issue: Initially app was pointed to Node1 in DC1, which got timed out as seen from other node. But Node1 shows wsrep_cluster_size 5, wsrep_cluster_status primary, since percona service was not down the app connections didn’t fail over to Node2 in DC1 instead caused app failures with below error

mysql saveOrUpdateAuditTrailEntry: failed with Communications link failureThe last packet successfully received from the server was 2,100,003 milliseconds ago. The last packet sent successfully to the server was 2,100,003 milliseconds ago

As part of solving,

  1. I have stopped Node1 expecting apps to be routed to Node2, but reads were happening and not writes and app was experiencing slowness.

  2. Then restarted Node2, but service was down

  3. I could not route the application to DC2, due to other applications(not mysql) were not ready for the failover.

  4. To resolve, I have stopped services on all DC2 machines, bootstrapped Node2 in dc1, while all the other nodes in cluster were not up.

  5. Node2 was up and applications were able to read write to db, and application had no outage.

  6. This point cluster was running with one node, I later joined Node1 in DC1 with clean SST

  7. Now 2 nodes in DC1 were up, then I joined Node1(DC2) with wsrep_cluster_address=“gcomm://Node2(DC1)” , it is taking almost 3 hours, still the SST streaming is not complete

  8. After step7 completes I need to join Node2, Node3 ( DC2) to the cluster.

With the above steps, when a single node is un-responsive , the entire cluster is affected and recovery is taking long time. Can you please let me know the optimized way to recover considering my case.

@shirisha

PXC cluster, 2 nodes in DC1, 3 nodes in DC2.

Having this kind of topology is dangerous, where the main DC (DC1), which the application interacts with and performs operations, has only 2 data nodes. If the DC2 is unavailable or completely down, it can affect the main cluster due to the loss of the majority of members (voting/quorum loss).

Moreover, across-DC latency on a single member can slow down the entire PXC cluster if the network is unstable.

If you are experiencing network instability between the DC1 and DC2 nodes, you can try tuning the following WAN-related options.

evs.send_window
evs.user_send_window
gmcast.segment
  • List item

https://docs.percona.com/percona-xtradb-cluster/8.0/wsrep-provider
index.html#evsuser_send_window

If you need a separate DR cluster where some replication lag is accepted, and mostly keeping that for a backup/recovery purpose then you can bridge both cluster via async replication. In that way, any workload , flow control scenarios wouldn’t impact the main cluster.

Regarding your issue, did you capture any details like show full processlist or show engine innodb status around the issue period ? Can you share the database logs for some quick looking?

51_error.log (1.3 MB)

I will tune evs.send_window, evs.user_send_window. But in my case can you please check the logs and confirm if the problem is due to network instability (since initial check from network team reported of no packet loss or latency) or is it due to any bottlenecks with cluster setup or functionality.

Attached is the error log of the Node2 DC1 machine.
I don’t have logs in any machine in cluster for time period 2026-02-23T18:08 to 2026-02-23T19:25, after which I started the recovery steps.
Except for below message in Node2(DC1) but the connection was re-established within a minute
2026-02-23T18:08:13.136091Z 0 [Note] [MY-000000] [Galera] (50340747-ad46, ‘ssl://0.0.0.0:4567’) connection to peer d5566c98-ae1a with addr ssl://x.x.x.165:4567 timed out, no messages seen in PT3S, socke

t stats: rtt: 58826 rttvar: 4 rto: 2072000 lost: 1 last_data_recv: 3441 cwnd: 1 last_queued_since: 500007347 last_delivered_since: 3441238066 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages:

0 segment: 1 messages: 0 (gmcast.peer_timeout)

Two things I want to confirm

  1. Incase the root cause for the issue is false eviction(since cluster size was still showing 5 on all nodes) of the node due to gmcast.peer_timeout = PT3S is too low. Then why didn’t the remaining nodes didn’t continue processing queries instead showed app errors ?
  2. Incase if truly a node is unresponsive, 4 out of 5 are still up cluster should not have any issues with read/write correct ? or is it entering into split-brain and halting all reads and writes ?

pxc5_errorlog_feb23.docx (16.1 KB)

@shirisha

Thanks for sharing the logs.

The message pattern “timed out, no messages seen in PT3S” denotes some temporary network issue. I don’t see any timeout messages further or among other data nodes.

Node2(dc1) - Error.log

2026-02-23T18:08:13.136091Z 0 [Note] [MY-000000] [Galera] (50340747-ad46, 'ssl://0.0.0.0:4567') connection to peer d5566c98-ae1a with addr ssl://23.49.4.165:4567 timed out, no messages seen in PT3S, socket stats: rtt: 58826 rttvar: 4 rto: 2072000 lost: 1 last_data_recv: 3441 cwnd: 1 last_queued_since: 500007347 last_delivered_since: 3441238066 send_queue_length: 0 send_queue_bytes: 0 segment: 0 messages:
0 segment: 1 messages: 0 (gmcast.peer_timeout)


2026-02-23T19:25:18.201968Z 0 [Note] [MY-000000] [Galera] forgetting bebb5f4a-9462 (ssl://23.54.31.179:4567)

Node1(dc2) - Error.log

2026-02-23T19:25:17.143461Z 0 [Note] [MY-000000] [Galera] (d5566c98-ae1a, 'ssl://0.0.0.0:4567') connection to peer bebb5f4a-9462 with addr ssl://23.54.31.179:4567 timed out, no messages seen in PT3S, sock
et stats: rtt: 58958 rttvar: 56 rto: 259000 lost: 0 last_data_recv: 3129 cwnd: 808 last_queued_since: 13706 last_delivered_since: 3129838673 send_queue_length: 1 send_queue_bytes: 656 segment: 0 messages:
 1 segment: 1 messages: 0 (gmcast.peer_timeout)

2026-02-23T19:25:18.172635Z 0 [Note] [MY-000000] [Galera] forgetting bebb5f4a-9462 (ssl://23.54.31.179:4567)

But all member having issue while connecting with 23.54.31.179. I believe this is the DC1 [Node1] ?

2026-02-23T19:25:18.201968Z 0 [Note] [MY-000000] [Galera] forgetting bebb5f4a-9462 (ssl://23.54.31.179:4567)

  1. Incase the root cause for the issue is false eviction(since cluster size was still showing 5 on all nodes) of the node due to gmcast.peer_timeout = PT3S is too low. Then why didn’t the remaining nodes didn’t continue processing queries instead showed app errors ?

From the below comment, it appears the reads serve fine however the issue happen with Writes only. The thing is even though all nodes connected and in synced still any of the slowest member or if any transactions stuck, could impact the whole cluster performance and in some situation can trigger flow control or halt the complete writes. The WAN could delay it even more. It could be possible you hit with any such scenarios.

I have stopped Node1 expecting apps to be routed to Node2, but reads were happening and not writes and app was experiencing slowness.

  1. Incase if truly a node is unresponsive, 4 out of 5 are still up cluster should not have any issues with read/write correct ? or is it entering into split-brain and halting all reads and writes ?

Yes even though one of the member is down or having issues the other node should still serve the read operations. I believe the problem was the impacting node doesn’t properly eliminated or cause some abnormal behaviour.

Without having a detailed historical information about the workload/processlist and flow control stats it would be bit harder to pinpoint the exact culprit.

Are the OS resources (Disk/, IO) fine around the period on all member nodes ?

Below were the changes made to address the issue (earlier these parameters were using default values)

gmcast.peer_timeout=PT90S;

evs.keepalive_period=PT2S;

evs.suspect_timeout=PT60S;

evs.inactive_timeout=PT90S;

gcs.fc_limit=256;

gcache.size=1G"

While trying to debug the issue I noticed wsrep_flow_control_interval_low and wsrep_flow_control_interval_high values are same. I have not set to 572, I’m not sure changing other values effected this.

  1. What caused these low and high values to change ?
  2. I understand having both values same is a concern, what is the recommeded production value?

mysql> show status like ‘%flow%’;
±------------------------------------±-------------+
| Variable_name | Value |
±------------------------------------±-------------+
| Ssl_session_cache_overflows | 0 |
| Table_open_cache_overflows | 0 |
| Table_open_cache_triggers_overflows | 0 |
| wsrep_flow_control_paused_ns | 0 |
| wsrep_flow_control_paused | 0 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_flow_control_active | false |
| wsrep_flow_control_requested | false |
| wsrep_flow_control_interval | [ 572, 572 ] |
| wsrep_flow_control_interval_low | 572 |
| wsrep_flow_control_interval_high | 572 |
| wsrep_flow_control_status | OFF |
±------------------------------------±-------------+

51error.log (16.2 KB)

23229.error.log (16.0 KB)

229error.log (21.2 KB)

179error.log (52.1 KB)

165error.log (842.0 KB)

I have attached the latest logs from all 5 nodes in PXC , can you please check and let me know why cluster is seen temporarily in NON-PRIMARY state and node partioning. Please let me know the parameters which I can fine tune in my cluster to improve support for network glitches and latency.

@shirisha

Below were the changes made to address the issue (earlier these parameters were using default values)

gmcast.peer_timeout=PT90S;
evs.keepalive_period=PT2S;
evs.suspect_timeout=PT60S;
evs.inactive_timeout=PT90S;

Thanks for the update.

These value seems way higher than the default. Basically, the order below can be referenced when changing any such values. Delaying the checking/suspect timeout might help in some cases where the network is slow, but it can also delay evicting the problematic node. Is the issue resolved after the above changes?

evs.keepalive_period     <=    evs.inactive_check_period
evs.inactive_check_period    <=    evs.suspect_timeout
evs.suspect_timeout     <=    evs.inactive_timeout
evs.inactive_timeout     <=    evs.consensus_timeout

Moreover, we suggested checking the following parameters if WAN network instability is noticed.

If you are experiencing network instability between the DC1 and DC2 nodes, you can try tuning the following WAN-related options.

evs.send_window
evs.user_send_window
gmcast.segment

| wsrep_flow_control_paused_ns | 0 |
| wsrep_flow_control_paused | 0 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_flow_control_active | false |

I don’t see any flow-control emission signs per the above stats.

| wsrep_flow_control_interval | [ 572, 572 ] |
| wsrep_flow_control_interval_low | 572 |
| wsrep_flow_control_interval_high | 572 |

The above status can dynamically reflect based on the FC settings. If you are noticing a big wsrep_local_recv_queue , you may increase the FC limit [fc_limit] further.

E.g,
set global wsrep_provider_options="gcs.fc_limit=700; gcs.fc_master_slave=YES; gcs.fc_factor=1.0";

Note - gcs.fc_master_slave was deprecated as of Galera 4.10 in favour of gcs.fc_single_primary. So, if running PXC 8.0 (Galera 4), you should use gcs.fc_single_primary instead of gcs.fc_master_slave.

Also, when gcs.fc_master_slave/gcs.fc_single_primary is disabled, the queue limit is automatically increased based on the number of nodes.

Are you writing on all nodes or a single node only at a time ?

I see in a very short span the node goes into non-primary state or it partitiioned however not sure about the triggered reason. Please check around the mentioned time slot if you see any surge in workload , OS resources or any network fluctions. Your OS and network team might better help on that.

2026-03-03T04:37:49.804064Z 0 [Note] [MY-000000] [Galera] evs::proto(294239f8-9cd8, GATHER, view_id(REG,1ccf0386-8991,9)) install timer expired
evs::proto(evs::proto(294239f8-9cd8, GATHER, view_id(REG,1ccf0386-8991,9)), GATHER) {
current_view=Current view of cluster as seen by this node
view (view_id(REG,1ccf0386-8991,9)
memb {
	1ccf0386-8991,0
	294239f8-9cd8,0
	6ca59fc2-b846,0
	bb7df359-94ef,0
	fe15739e-ab4d,0
	}
joined {
	}
left {
	}
partitioned {
	}
),

2026-03-03T04:37:49.806602Z 0 [Note] [MY-000000] [Galera] no install message received
2026-03-03T04:37:49.806881Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,1ccf0386-8991,9)
memb {
	1ccf0386-8991,1
	294239f8-9cd8,1
	}
joined {
	}
left {
	}
partitioned {
	6ca59fc2-b846,0
	bb7df359-94ef,0
	fe15739e-ab4d,0
	}
)

It would be better to test any such changes in a non-production environment, and if you get the desired result, only then apply it in production.