Galera/PXC Cluster Permanent Stall: Failed DDL Voting Storm + Pending TOI Deadlock

Description:

A burst of ~50 failed ALTER TABLE … DROP PARTITION statements (Error 1507 - “Error in list of
partitions to DROP”) issued sequentially across ~50 databases by a single application thread caused all three nodes of a PXC 8.0.42-33.1 cluster to enter consecutive Galera voting rounds.

Combined with concurrent client DML load, this overwhelmed the receiver appliers. A subsequent legitimate TOI DDL (ALTER TABLE … REORGANIZE PARTITION) on the writer then got stuck indefinitely in wsrep: preparing for TO isolation, creating a flow-control deadlock that did not self-recover.

The cluster remained in this stalled state for 26+ hours until manual recovery via pod deletion.

During the stall, all clients write blocked, MySQL on the writer became completely unresponsive (could not even answer SELECT 1), and the receivers were frozen at the same seqno (wsrep_last_committed) for the entire duration.

The same failed DDL pattern on standalone MySQL is harmless — the application catches Error 1507 and proceeds. On Galera, each failed DDL triggered cluster-wide voting before the application’s error handler received the error, paying the cluster cost regardless of how the application handled it.

Steps to Reproduce:

The bug pattern (Error 1507 + voting on all nodes) reproduces in 1 second on a 3-node cluster:

  1. Create N databases, each with a partitioned table containing a catchall partition:
CREATE DATABASE repro_db_001;
USE repro_db_001;
CREATE TABLE t (
id BIGINT NOT NULL AUTO_INCREMENT,
ts DATETIME NOT NULL,
PRIMARY KEY (id, ts)
) PARTITION BY RANGE (TO_DAYS(ts)) (
PARTITION p_min      VALUES LESS THAN (TO_DAYS(‘2025-01-01’)),
PARTITION p_2025_01  VALUES LESS THAN (TO_DAYS(‘2025-02-01’)),
PARTITION p_2025_02  VALUES LESS THAN (TO_DAYS(‘2025-03-01’)),
PARTITION old_max    VALUES LESS THAN MAXVALUE
);
– Repeat for repro_db_002 … repro_db_050

  1. From a single MySQL session against the writer, issue the buggy pattern across all N databases
    sequentially:
USE repro_db_001;
ALTER TABLE t REORGANIZE PARTITION old_max INTO (
PARTITION p_2025_03 VALUES LESS THAN (TO_DAYS(‘2025-04-01’)),
PARTITION new_max   VALUES LESS THAN MAXVALUE
);
ALTER TABLE t DROP PARTITION old_max;   – FAILS with Error 1507 (old_max already gone)

USE repro_db_002;
ALTER TABLE t REORGANIZE PARTITION old_max INTO (…);
ALTER TABLE t DROP PARTITION old_max;   – FAILS

– … repeat for all 50 databases
  1. Observe the mysqld error log on any node — each failed DROP triggers a 3-node voting round.

Conditions required to escalate from voting storm into full deadlock (we observed all three in
production but could only reliably reproduce conditions 1 and 2 in a controlled test):

  • Condition 1: The failed-DDL pattern above (~50 in rapid succession).
  • Condition 2: Tables with substantial data (so each REORGANIZE on the receiver is non-trivial work).
    Production tables had GBs of data; empty test tables let receivers keep up trivially.
  • Condition 3: Concurrent client DML load competing for receiver applier resources.

With all three present, the deadlock is reliably triggered by the next legitimate TOI DDL issued after
the storm.

Version:

  • Server: 8.0.42-33.1 Percona XtraDB Cluster (GPL), Release rel33, Revision 6673f8e
  • WSREP: 26.1.4.3
  • Operator: Percona Operator for PXC (Kubernetes deployment)
  • OS: Linux x86_64

Relevant Galera provider options:

repl.commit_order = 3 (strict commit ordering)
cert.optimistic_pa = no (conservative parallel applying)
gcs.fc_limit = 100
gcs.fc_factor = 1.0
gcache.size = 4G
gcache.recover = yes

Server config:

innodb_buffer_pool_size = 8G
max_connections = 1000
wsrep_slave_threads = 8
wsrep_sst_method = xtrabackup-v2

Logs:

Failed DDL voting events (50 in 6.8 seconds)

2026-04-23T10:31:08.108514Z 15 [Warning] [MY-000000] [WSREP] Event 1 Query apply failed: 1, seqno
1453156
2026-04-23T10:31:08.262998Z 11 [Warning] [MY-000000] [WSREP] Event 1 Query apply failed: 1, seqno
1453159
2026-04-23T10:31:08.377533Z 15 [Warning] [MY-000000] [WSREP] Event 1 Query apply failed: 1, seqno
1453162
… (continuing every ~130ms)
2026-04-23T10:31:14.890197Z 2 [Warning] [MY-000000] [WSREP] Event 1 Query apply failed: 1, seqno
1453343

Single voting round (representative example)

2026-04-23T10:31:13.110814Z 15 [Warning] [MY-000000] [WSREP] Event 1 Query apply failed: 1, seqno
1453302
2026-04-23T10:31:13.112502Z 0 [Note] [MY-000000] [Galera] Member 0(node-2) initiates vote on
:1453302,94ebcd795f69ea70: Error in list of partitions to DROP, Error_code: 1507;
2026-04-23T10:31:13.112627Z 0 [Note] [MY-000000] [Galera] Recomputed vote based on error codes: 1507.
New vote 885fc42092da4791 will be used for further steps. Old Vote: 94ebcd795f69ea70
2026-04-23T10:31:13.112650Z 0 [Note] [MY-000000] [Galera] Member 2(node-1) initiates vote on
:1453302,94ebcd795f69ea70: Error in list of partitions to DROP, Error_code: 1507;
2026-04-23T10:31:13.112745Z 0 [Note] [MY-000000] [Galera] Votes over :1453302:
885fc42092da4791: 2/3
Winner: 885fc42092da4791

Pending TOI DDL stuck for 27,517 seconds (from INFORMATION_SCHEMA.PROCESSLIST)

Id        Time     State                              Info
2264721   27517s   wsrep: preparing for TO isolation  ALTER TABLE other_table REORGANIZE PARTITION
om_max INTO (…)
2465436   24350s   wsrep: preparing for TO isolation  TRUNCATE TABLE foo /* … /
2465437   24350s   wsrep: preparing for TO isolation  TRUNCATE TABLE foo / … /
2465438   24350s   wsrep: preparing for TO isolation  TRUNCATE TABLE foo / … */
… (348 client connections in ‘wsrep: replicating and certifying write set’ state)


SHOW STATUS LIKE ‘wsrep_%’ during the stall

┌────────────────────────────┬────────────────┬──────────────────┬──────────────────┐
│          Variable          │ writer (pxc-0) │ receiver (pxc-1) │ receiver (pxc-2) │
├────────────────────────────┼────────────────┼──────────────────┼──────────────────┤
│ wsrep_cluster_status       │ Primary        │ Primary          │ Primary          │
├────────────────────────────┼────────────────┼──────────────────┼──────────────────┤
│ wsrep_cluster_size         │ 3              │ 3                │ 3                │
├────────────────────────────┼────────────────┼──────────────────┼──────────────────┤
│ wsrep_local_state_comment  │ Synced         │ Synced           │ Synced           │
├────────────────────────────┼────────────────┼──────────────────┼──────────────────┤
│ wsrep_last_committed       │ 1453533        │ 1453350          │ 1453350          │
├────────────────────────────┼────────────────┼──────────────────┼──────────────────┤
│ wsrep_replicated           │ 1,453,505      │ 0                │ 0                │
├────────────────────────────┼────────────────┼──────────────────┼──────────────────┤
│ wsrep_received             │ 27,302         │ 1,480,747        │ 1,480,741        │
├────────────────────────────┼────────────────┼──────────────────┼──────────────────┤
│ wsrep_flow_control_paused  │ 0.33 → 0.62    │ 0.33 → 0.62      │ 0.33 → 0.62      │
├────────────────────────────┼────────────────┼──────────────────┼──────────────────┤
│ wsrep_flow_control_sent    │ 0              │ 57               │ 31               │
├────────────────────────────┼────────────────┼──────────────────┼──────────────────┤
│ wsrep_flow_control_recv    │ 88             │ 88               │ 88               │
├────────────────────────────┼────────────────┼──────────────────┼──────────────────┤
│ wsrep_local_recv_queue_avg │ 0.003          │ 0.26             │ 0.81             │
├────────────────────────────┼────────────────┼──────────────────┼──────────────────┤
│ wsrep_apply_window         │ 2.01           │ 1.58             │ 1.65             │
├────────────────────────────┼────────────────┼──────────────────┼──────────────────┤
│ wsrep_cert_deps_distance   │ 1.99           │ 1.99             │ 1.99             │
└────────────────────────────┴────────────────┴──────────────────┴──────────────────┘

wsrep_flow_control_paused = 0.62 after ~26 hours indicates the cluster has been 100% paused since the stall began (cumulative average since metric reset). The receivers were stuck only 7 seqnos past the last failed DROP (seqno 1453343 → frozen at 1453350) for the entire 26-hour duration.

Expected Result:

Cluster shouldn’t go to Stall state permanently, it can probably be stalled for sometime may be, but should heal in sometime

Actual Result:

  • Each failed DDL triggered a 3-node voting round consuming a global seqno and ~100-150ms of cluster
    wall time
  • 50 voting rounds in 6.8 seconds overwhelmed receiver applier throughput
  • A subsequent legitimate TOI DDL got stuck in wsrep: preparing for TO isolation indefinitely (27,517+
    seconds observed)
  • All client writes on the writer accumulated in wsrep: replicating and certifying write set state
    behind the pending TOI (repl.commit_order = 3 enforces commit ordering)
  • Receivers sent flow control to writer; writer could not release pending TOI until receivers caught up;
    receivers had no new writesets to process because writer was blocked → permanent deadlock
  • After ~26 hours, MySQL on the writer became completely unresponsive (could not answer SELECT 1)
  • No self-recovery occurred. Manual recovery required deleting all 3 PXC pods to force a bootstrap from
    the most-advanced node’s PVC

Additional Information:

Why this matters

The same code path on standalone MySQL is completely benign: the application catches Error 1507 and continues, with zero impact on other connections or future operations. The pattern of “issue a DDL, catch the error if it fails, move on” is common in MySQL-targeted application code.

On Galera/PXC, the same code path can stall the entire cluster — but only when the failed DDL pattern coincides with sufficient receiver load (table data + concurrent DMLs). This makes the bug latent: it can lurk in production for months and only surface when conditions converge. In our case, the application code had been running this pattern across our cluster for 3+ months before the deadlock manifested.

Application-side workaround we deployed

Check partition existence before issuing the DROP DDL, so the failed DDL never enters Galera
replication:

Before

db.execute(“ALTER TABLE t DROP PARTITION old_max”) # may fail with 1507, app catches it

After

if partition_exists(db, “t”, “old_max”):
   db.execute(“ALTER TABLE t DROP PARTITION old_max”)

This eliminates the voting events entirely. We also audited all “catch DDL error and continue” patterns in our codebase, since these are silent on standalone MySQL but trigger cluster-wide consensus on Galera.

Recovery procedure that worked

  1. Delete all 3 PXC pods (operator-managed cluster)
  2. Operator ran wsrep_recover on each PVC to determine local seqno via InnoDB redo log replay
  3. Operator bootstrapped from the most-advanced node (the pre-stall writer, seqno 1453533 — vs receivers at 1453350)
  4. Other two nodes joined via SST (xtrabackup-v2, ~25 GiB each at ~180 MiB/s)
  5. Total recovery time: ~20 minutes, no committed data lost (committed-but-not-applied writesets on the writer were preserved because we bootstrapped from its PVC)

Questions

  1. Is there anything wrong we are doing, or something we can do so that we avoid such issus in the future?
  2. Are there existing bug reports tracking this scenario? Happy to share additional logs (full
    mysqld.log, gcache state, anything else useful).
2 Likes