PXC with 5 nodes in production across WAN

shirisha · April 17, 2026, 6:07pm

Hello,

I have 5 node PXC cluster ( 2 nodes in DC1, 3 nodes in DC2). Haproxy is configured such that it points to one primary write node in local dc and backend config has other nodes( localdc first and then remote dc)

Issue scenario: DC2 was active, DC1 passive. Apps were connected to Node1 in DC2. Max_used_connections reached and all writes/reads were stalled. The spike in connection count happened within 10mins and it almost doubled than the standard connection count.

Then I stopped the Percona service on the Node1 DC2, expecting the connections to be routed to Node2 DC2 since quorum(3 out of remaining 4 nodes) was satisfied, but this didn’t help, still the writes were not happening.
I then stopped Percona service on two nodes in passive DC( DC1) which made the cluster working normal.
At this point I had first node in DC2(active) stopped and two nodes in DC1(passive) stopped

I understand PXC considerations for synchronous replication, below are the few questions which help me investigate my case which need your insight.

Reason for connection spike: Assuming the earlier connections to the db were stalled and apps didn’t receive any response so the clients started reconnecting which doubled the connection count and blocked all reads/writes, how can I check this from DB ? Is there a way I can check from db or grafana were I can see if my connections to DB was stalled or left idle or waiting ?
1. I have log_error_verbosity=3, but I don’t have any information if the previous connections were stalled or what caused connection spike. I saw only too many connections recorded in error.log
Quorum was not satisfied hence writes were blocked ? When one node is down in cluster, shouldn’t the remaining 4 nodes have majority and continue the cluster functionality ? Quorum should be maintained at a cluster level not dc level correct ?
1. Having arbritator node will solve the issue ? 2 nodes in DC1, 2 nodes in DC2, 1 node in DC3 will this architecture help ? When one node in primary site is down then again we end up with 4 nodes will it help satisfy quorum?
Write consistency was missing ? Is it that the remaining 4 nodes in cluster were not aware of the write consistent state of the first node which was stopped due to high connections hence the cluster blocked all reads and writes until manual bootstrap of the node ?
1. I have only stopped service on the passive nodes but not bootstrapped any node to recover the cluster.
I have seen graceful shutdown of one node in cluster, does not impact reads/writes in my cluster. But sudden events(one eg:connection spike) or weird behavior puts a machine in bad state then my cluster is not able to recover even if it has 4 nodes healthy. Is this a expected behavior ? Do I need any tuning ?
Do you recommend PXC running in production with active-active-active setup over WAN (2-nodes DC1, 3-nodes DC2, 2-nodes DC3) ?

anil.joshi · April 18, 2026, 3:16pm

@shirisha

Reason for connection spike: Assuming the earlier connections to the db were stalled and apps didn’t receive any response so the clients started reconnecting which doubled the connection count and blocked all reads/writes, how can I check this from DB ? Is there a way I can check from db or grafana were I can see if my connections to DB was stalled or left idle or waiting ?
I have log_error_verbosity=3, but I don’t have any information if the previous connections were stalled or what caused connection spike. I saw only too many connections recorded in error.log

For current monitoring or validation, you can check using the command below.

mysql> show full processlist;
or
mysql> select * from information_schema.processlist;

For more granularity of information (transactions, semaphores, redo etc.), you can check the InnoDB status.

mysql> show engine innodb status\G;

For historical insight into connections, it’s better to have some monitoring enabled, like Grafana or PMM.

Grafana MySQL dashboards - MySQL Dashboard | Grafana Labs

PMM - MySQL Instance Summary - Percona Monitoring and Management

If you have enabled auditing in your database, it can be a handy way to check past connections or their execution details.

Performance_schema also provide pretty useful stats about past executed statements, transactions etc. You can explore it - https://dev.mysql.com/doc/refman/8.0/en/performance-schema.html in case not enabled yet.

Quorum was not satisfied hence writes were blocked ? When one node is down in cluster, shouldn’t the remaining 4 nodes have majority and continue the cluster functionality ? Quorum should be maintained at a cluster level not dc level correct ?

Having arbritator node will solve the issue ? 2 nodes in DC1, 2 nodes in DC2, 1 node in DC3 will this architecture help ? When one node in primary site is down then again we end up with 4 nodes will it help satisfy quorum?

This also depends on how you configured your DR. Are the other side (DR) connected to the DC via Asynchronous replication, or is it part of the Galera Virtual replication directly?

Considering 3 Nodes in DC, if one node is down, ideally, the primary component should be available, as the majority of nodes (2 members) are still available; however, there are some other scenarios, like flow control, where triggers can stall writes on the entire cluster. Please note that PXC/Galera deals in virtually synchronous replication , so if any of the node is struggling or having any performance problems already can degrade or impact the whole cluster performance.

You can monitor based on the last status, which node triggers flow control[wsrep_flow_control_sent], and what other nodes received/impacted by FC [wsrep_flow_control_recv].

node1 [localhost:29334] {root} ((none)) > show global status like 'wsrep_flow_control_%'; 
+----------------------------------+--------------+
| Variable_name                    | Value        |
+----------------------------------+--------------+
| wsrep_flow_control_paused_ns     | 0            |
| wsrep_flow_control_paused        | 0            |
| wsrep_flow_control_sent          | 0            |
| wsrep_flow_control_recv          | 0            |
| wsrep_flow_control_active        | false        |
| wsrep_flow_control_requested     | false        |
| wsrep_flow_control_interval      | [ 173, 173 ] |
| wsrep_flow_control_interval_low  | 173          |
| wsrep_flow_control_interval_high | 173          |
| wsrep_flow_control_status        | OFF          |
+----------------------------------+--------------+
10 rows in set (0.00 sec)

About Flow control and related settings, you can better read here -

Write consistency was missing ? Is it that the remaining 4 nodes in cluster were not aware of the write consistent state of the first node which was stopped due to high connections hence the cluster blocked all reads and writes until manual bootstrap of the node ?
I have only stopped service on the passive nodes but not bootstrapped any node to recover the cluster.

Well, to know exactly what happened, you can share the MySQL error logs for the period during which the issue occurred on all 5 nodes. It is likely due to heavy connections or writes, some flow-control activity that impacts the whole cluster, or a performance issue with the specific node.

I have seen graceful shutdown of one node in cluster, does not impact reads/writes in my cluster. But sudden events(one eg:connection spike) or weird behavior puts a machine in bad state then my cluster is not able to recover even if it has 4 nodes healthy. Is this a expected behavior ? Do I need any tuning ?

Sudden writes can cause FC issues if they exceed the received queue limit. Moreover, in the database, the max_connections are exhausted, so even if one node is down, the other might not be able to clear the connections, or the limit has already been reached there as well.

Exactly what those connections were doing? Are they mostly writing or reading also?

Any unoptimized/slow query can also impact the performance. There could be chances of contention/semaphores or locking overhead as well.

To check the real-time locking/waiting table details, you can use the query below.

SELECT
 waiting_trx_id,
 waiting_pid,
 waiting_query,
 blocking_trx_id,
 blocking_pid,
 blocking_query
 FROM sys.innodb_lock_waits;

SELECT THREAD_ID FROM performance_schema.threads WHERE PROCESSLIST_ID = <blocking_pid>

SELECT THREAD_ID, SQL_TEXT FROM performance_schema.events_statements_current
 WHERE THREAD_ID = <THREAD_ID>;

Do you recommend PXC running in production with active-active-active setup over WAN (2-nodes DC1, 3-nodes DC2, 2-nodes DC3) ?

We don’t encourage running multi-write on PXC nodes. At a time, only one node should be writing.

Writing to multiple nodes can cause data inconsistencies and conflicts/deadlocks.

matthewb · April 20, 2026, 7:41pm

Since you have PXC split between data centers, consider setting wsrep_provider_options=gmcast.segment=1 for nodes in DC1, and =2 for nodes in DC2.

Topic		Replies	Views
PXC connection timeout in multidc environment Percona XtraDB Cluster 8.x mysql , percona	6	108	March 6, 2026
PXC cluster for mysql is not choosing the secondary as primary Percona XtraDB Cluster 5.x mysql , percona	2	475	September 10, 2024
Can PXC over WAN be implemented in production MySQL & MariaDB	7	85	April 21, 2026
Too many connections brings down cluster Percona XtraDB Cluster 5.x	34	11960	September 3, 2024
Galera Arbitrator on the same server of a node? Percona XtraDB Cluster 8.x	24	989	February 22, 2025

PXC with 5 nodes in production across WAN

Related topics