PXC 8.4.6-6.1: recurring MY-013183 assertions (btr0cur.cc:298, btr0pcur.cc:383)

Hi all,

We’re seeing a sustained pattern of InnoDB B-tree and page-invariant assertion failures across two independent 3-node Percona XtraDB Cluster deployments running the same build, and would appreciate any insight — particularly whether the signatures map to a known Jira ticket and whether a newer 8.4 patch (8.4.7-7 or 8.4.8-8) is likely to address them.

Originally we suspected a single bug on the wsrep applier path, but a careful re-read of the live error logs shows multiple distinct signatures across applier, client-SQL, and background purge paths. We’re posting in case anyone recognises the family.

Environment

  • Percona XtraDB Cluster: 8.4.6-6.1 (Release rel6, Revision 9ca703c)
  • WSREP version: 26.1.4.3
  • Galera provider: 4.23 (cb05b32) — libgalera_smm.so
  • BuildID[sha1]: 74d46472b3f75773c0f80c776fe8c9a2c5bc589a (identical across both clusters)
  • OS: Debian 12, kernel 6.1.0-31-amd64 / 6.1.0-40-amd64
  • Topology: Two independent 3-node clusters (“BI” and “Cozi”), each multi-master. Versions confirmed consistent across all six nodes.
  • Workload: Moderate OLTP. Live samples per writer: ~270 Galera write sets/sec, ~850–1,200 client Questions/sec; ~1.5k Questions/sec summed across nodes.

Key relevant settings (verified at runtime on all six nodes):

cert.optimistic_pa         = YES
wsrep_slave_threads        = 24
innodb_adaptive_hash_index = ON

Symptom summary

Live error logs (no rotated history available) show 23 explicit [MY-013183] assertion failures plus 2 non-assertion mysqld got signal blocks, distributed as follows:

Cluster / node Assertions Applier path Client SQL Purge / bg Other
BI 172.35.0.161 3 2 0 1
BI 172.35.0.162 4 1 2 1 sema hang
BI 172.35.0.163 7 4 2 1
Cozi 172.31.0.101 0 0 0 0
Cozi 172.31.0.102 7 6 1 0
Cozi 172.31.0.103 2 1 1 0
Totals 23 14 6 2 1

Plus 2 non-assertion signal blocks on BI: one Galera/client commit abort, one shutdown SIGSEGV.

Note: one Cozi node (172.31.0.101) has seen zero crashes in the current log window despite identical build and workload class. Per-node asymmetry is consistent with race-based bugs rather than data corruption on shared content (cluster replicates everything, so a shared on-disk corruption would not be node-local).

Representative stacks (sanitised)

(a) Applier path — BI 172.35.0.163, 2026-05-09

[MY-013183] [InnoDB] Assertion failure: btr0cur.cc:298:
  btr_page_get_prev(get_block->frame, mtr) == page_get_page_no(page)

#4  btr_cur_latch_leaves
#5  btr_cur_search_to_nth_level
#7  row_ins_clust_index_entry_low
#8  row_ins_clust_index_entry
#9  row_ins_step
#11 row_insert_for_mysql
#12 ha_innobase::write_row
#13 handler::ha_write_row
#14 Write_rows_log_event::write_row
#15 Write_rows_log_event::do_exec_row
#17 Rows_log_event::do_apply_event
#19 wsrep_apply_events
#21 Wsrep_applier_service::apply_write_set
#31 start_wsrep_THD

(b) Client SQL path — Cozi 172.31.0.102, 2026-05-01

[MY-013183] [InnoDB] Assertion failure: btr0pcur.cc:383:
  cur_page == prev_of_next

#4  btr_pcur_t::move_to_next_page
#5  btr_pcur_t::move_to_next
#6  row_search_mvcc
#7  ha_innobase::general_fetch
#8  handler::ha_index_next
#9  handler::read_range_next
#11 ha_innobase::read_range_next
#12 handler::multi_range_read_next
#14 IndexRangeScanIterator::Read
#15 FilterIterator::Read
#16-25 NestedLoopIterator::Read (x10)
#26 LimitOffsetIterator::Read
#27 Query_expression::ExecuteIteratorQuery
#29 Sql_cmd_dml::execute_inner
#31 mysql_execute_command
#33 wsrep_dispatch_sql_command
#34 dispatch_command
#36 threadpool_process_request
#37 worker_main

In addition there are smaller numbers of crashes in purge/background threads and one semaphore wait timeout / hang. Full stacks for all 23 events can be shared privately if helpful.

What we’ve considered (and ruled in / out)

  • Single optimistic-parallel-apply bug? This was our initial hypothesis given the applier-path majority and our settings (cert.optimistic_pa=YES, wsrep_slave_threads=24, AHI on). But it cannot directly explain the 6 client-SQL-path crashes or the purge-path crashes, which never enter the apply pipeline. So at most, optimistic PA is a contributor to the applier subset, not the whole picture.
  • Latent on-disk B-tree corruption from applier crash, biting later SELECTs? Weakly supported. On Cozi 172.31.0.102, the 2026-05-01 client SELECT crash was preceded by applier crashes on the same node, but ~20 days earlier — not the hours-to-days window you’d expect if it were the same corrupted page being re-read. And the Cozi 172.31.0.103 client crash on 2026-03-31 was followed (not preceded) by a same-node applier/IST crash about 22 minutes later. So we don’t have a clean “applier crash → latent corruption → client crash” timeline.
  • PXC-3729 (fixed 8.0.25-15.1): same family (parallel applier conflict) but fix is years old and should be in our build.
  • PXC-4173, PXC-4498 (fixed 8.4.4-4): parallel-replication stalls, not crashes — unlikely match.
  • Bug #118705 / #38310595 (fixed 8.4.8-8): “concurrency flaw in InnoDB … que_eval_sql interface” — only candidate newer-version fix we’ve identified.

We searched bugs.mysql.com and Percona Jira for the specific assertion lines (btr0cur.cc:298, btr0pcur.cc:383) under both applier and client-SQL paths and didn’t find an exact match — happy to be pointed at one we missed.

Questions

  1. Are these one bug family or several? Can the community / Percona engineers tell from the signatures whether btr0cur.cc:298 (in btr_cur_latch_leaves during INSERT) and btr0pcur.cc:383 (in persistent-cursor move_to_next_page during SELECT range scan) plausibly share a root cause, or should we treat them as independent issues?
  2. Is there a known bug for either signature on 8.4.6? Any specific Jira ticket(s) we should track?
  3. Has either signature been fixed in 8.4.7-7 or 8.4.8-8? We are currently on 8.4.6-6.1 and would prefer to skip 8.4.7 and go straight to 8.4.8-8 if that is the version with the fix. Confirmation either way would help us scope the upgrade work.
  4. For the applier subset specifically, would the community recommend disabling cert.optimistic_pa, disabling AHI, or reducing wsrep_slave_threads as the lowest-risk mitigation while we plan an upgrade?
  5. Diagnostics: any wsrep_debug flags or InnoDB diagnostic settings you would recommend enabling so the next occurrence captures more useful detail?

Thanks in advance for any pointers.