Hi all,
We’re seeing a sustained pattern of InnoDB B-tree and page-invariant assertion failures across two independent 3-node Percona XtraDB Cluster deployments running the same build, and would appreciate any insight — particularly whether the signatures map to a known Jira ticket and whether a newer 8.4 patch (8.4.7-7 or 8.4.8-8) is likely to address them.
Originally we suspected a single bug on the wsrep applier path, but a careful re-read of the live error logs shows multiple distinct signatures across applier, client-SQL, and background purge paths. We’re posting in case anyone recognises the family.
Environment
- Percona XtraDB Cluster: 8.4.6-6.1 (Release rel6, Revision 9ca703c)
- WSREP version: 26.1.4.3
- Galera provider: 4.23 (cb05b32) —
libgalera_smm.so - BuildID[sha1]: 74d46472b3f75773c0f80c776fe8c9a2c5bc589a (identical across both clusters)
- OS: Debian 12, kernel 6.1.0-31-amd64 / 6.1.0-40-amd64
- Topology: Two independent 3-node clusters (“BI” and “Cozi”), each multi-master. Versions confirmed consistent across all six nodes.
- Workload: Moderate OLTP. Live samples per writer: ~270 Galera write sets/sec, ~850–1,200 client Questions/sec; ~1.5k Questions/sec summed across nodes.
Key relevant settings (verified at runtime on all six nodes):
cert.optimistic_pa = YES
wsrep_slave_threads = 24
innodb_adaptive_hash_index = ON
Symptom summary
Live error logs (no rotated history available) show 23 explicit [MY-013183] assertion failures plus 2 non-assertion mysqld got signal blocks, distributed as follows:
| Cluster / node | Assertions | Applier path | Client SQL | Purge / bg | Other |
|---|---|---|---|---|---|
| BI 172.35.0.161 | 3 | 2 | 0 | 1 | |
| BI 172.35.0.162 | 4 | 1 | 2 | 1 sema hang | |
| BI 172.35.0.163 | 7 | 4 | 2 | 1 | |
| Cozi 172.31.0.101 | 0 | 0 | 0 | 0 | |
| Cozi 172.31.0.102 | 7 | 6 | 1 | 0 | |
| Cozi 172.31.0.103 | 2 | 1 | 1 | 0 | |
| Totals | 23 | 14 | 6 | 2 | 1 |
Plus 2 non-assertion signal blocks on BI: one Galera/client commit abort, one shutdown SIGSEGV.
Note: one Cozi node (172.31.0.101) has seen zero crashes in the current log window despite identical build and workload class. Per-node asymmetry is consistent with race-based bugs rather than data corruption on shared content (cluster replicates everything, so a shared on-disk corruption would not be node-local).
Representative stacks (sanitised)
(a) Applier path — BI 172.35.0.163, 2026-05-09
[MY-013183] [InnoDB] Assertion failure: btr0cur.cc:298:
btr_page_get_prev(get_block->frame, mtr) == page_get_page_no(page)
#4 btr_cur_latch_leaves
#5 btr_cur_search_to_nth_level
#7 row_ins_clust_index_entry_low
#8 row_ins_clust_index_entry
#9 row_ins_step
#11 row_insert_for_mysql
#12 ha_innobase::write_row
#13 handler::ha_write_row
#14 Write_rows_log_event::write_row
#15 Write_rows_log_event::do_exec_row
#17 Rows_log_event::do_apply_event
#19 wsrep_apply_events
#21 Wsrep_applier_service::apply_write_set
#31 start_wsrep_THD
(b) Client SQL path — Cozi 172.31.0.102, 2026-05-01
[MY-013183] [InnoDB] Assertion failure: btr0pcur.cc:383:
cur_page == prev_of_next
#4 btr_pcur_t::move_to_next_page
#5 btr_pcur_t::move_to_next
#6 row_search_mvcc
#7 ha_innobase::general_fetch
#8 handler::ha_index_next
#9 handler::read_range_next
#11 ha_innobase::read_range_next
#12 handler::multi_range_read_next
#14 IndexRangeScanIterator::Read
#15 FilterIterator::Read
#16-25 NestedLoopIterator::Read (x10)
#26 LimitOffsetIterator::Read
#27 Query_expression::ExecuteIteratorQuery
#29 Sql_cmd_dml::execute_inner
#31 mysql_execute_command
#33 wsrep_dispatch_sql_command
#34 dispatch_command
#36 threadpool_process_request
#37 worker_main
In addition there are smaller numbers of crashes in purge/background threads and one semaphore wait timeout / hang. Full stacks for all 23 events can be shared privately if helpful.
What we’ve considered (and ruled in / out)
- Single optimistic-parallel-apply bug? This was our initial hypothesis given the applier-path majority and our settings (
cert.optimistic_pa=YES,wsrep_slave_threads=24, AHI on). But it cannot directly explain the 6 client-SQL-path crashes or the purge-path crashes, which never enter the apply pipeline. So at most, optimistic PA is a contributor to the applier subset, not the whole picture. - Latent on-disk B-tree corruption from applier crash, biting later SELECTs? Weakly supported. On Cozi 172.31.0.102, the 2026-05-01 client SELECT crash was preceded by applier crashes on the same node, but ~20 days earlier — not the hours-to-days window you’d expect if it were the same corrupted page being re-read. And the Cozi 172.31.0.103 client crash on 2026-03-31 was followed (not preceded) by a same-node applier/IST crash about 22 minutes later. So we don’t have a clean “applier crash → latent corruption → client crash” timeline.
- PXC-3729 (fixed 8.0.25-15.1): same family (parallel applier conflict) but fix is years old and should be in our build.
- PXC-4173, PXC-4498 (fixed 8.4.4-4): parallel-replication stalls, not crashes — unlikely match.
- Bug #118705 / #38310595 (fixed 8.4.8-8): “concurrency flaw in InnoDB …
que_eval_sqlinterface” — only candidate newer-version fix we’ve identified.
We searched bugs.mysql.com and Percona Jira for the specific assertion lines (btr0cur.cc:298, btr0pcur.cc:383) under both applier and client-SQL paths and didn’t find an exact match — happy to be pointed at one we missed.
Questions
- Are these one bug family or several? Can the community / Percona engineers tell from the signatures whether
btr0cur.cc:298(inbtr_cur_latch_leavesduring INSERT) andbtr0pcur.cc:383(in persistent-cursormove_to_next_pageduring SELECT range scan) plausibly share a root cause, or should we treat them as independent issues? - Is there a known bug for either signature on 8.4.6? Any specific Jira ticket(s) we should track?
- Has either signature been fixed in 8.4.7-7 or 8.4.8-8? We are currently on 8.4.6-6.1 and would prefer to skip 8.4.7 and go straight to 8.4.8-8 if that is the version with the fix. Confirmation either way would help us scope the upgrade work.
- For the applier subset specifically, would the community recommend disabling
cert.optimistic_pa, disabling AHI, or reducingwsrep_slave_threadsas the lowest-risk mitigation while we plan an upgrade? - Diagnostics: any
wsrep_debugflags or InnoDB diagnostic settings you would recommend enabling so the next occurrence captures more useful detail?
Thanks in advance for any pointers.