PXC 8.4.6-6.1: recurring MY-013183 assertions (btr0cur.cc:298, btr0pcur.cc:383)

Hi all,

We’re seeing a sustained pattern of InnoDB B-tree and page-invariant assertion failures across two independent 3-node Percona XtraDB Cluster deployments running the same build, and would appreciate any insight — particularly whether the signatures map to a known Jira ticket and whether a newer 8.4 patch (8.4.7-7 or 8.4.8-8) is likely to address them.

Originally we suspected a single bug on the wsrep applier path, but a careful re-read of the live error logs shows multiple distinct signatures across applier, client-SQL, and background purge paths. We’re posting in case anyone recognises the family.

Environment

  • Percona XtraDB Cluster: 8.4.6-6.1 (Release rel6, Revision 9ca703c)
  • WSREP version: 26.1.4.3
  • Galera provider: 4.23 (cb05b32) — libgalera_smm.so
  • BuildID[sha1]: 74d46472b3f75773c0f80c776fe8c9a2c5bc589a (identical across both clusters)
  • OS: Debian 12, kernel 6.1.0-31-amd64 / 6.1.0-40-amd64
  • Topology: Two independent 3-node clusters (“BI” and “Cozi”), each multi-master. Versions confirmed consistent across all six nodes.
  • Workload: Moderate OLTP. Live samples per writer: ~270 Galera write sets/sec, ~850–1,200 client Questions/sec; ~1.5k Questions/sec summed across nodes.

Key relevant settings (verified at runtime on all six nodes):

cert.optimistic_pa         = YES
wsrep_slave_threads        = 24
innodb_adaptive_hash_index = ON

Symptom summary

Live error logs (no rotated history available) show 23 explicit [MY-013183] assertion failures plus 2 non-assertion mysqld got signal blocks, distributed as follows:

Cluster / node Assertions Applier path Client SQL Purge / bg Other
BI 172.35.0.161 3 2 0 1
BI 172.35.0.162 4 1 2 1 sema hang
BI 172.35.0.163 7 4 2 1
Cozi 172.31.0.101 0 0 0 0
Cozi 172.31.0.102 7 6 1 0
Cozi 172.31.0.103 2 1 1 0
Totals 23 14 6 2 1

Plus 2 non-assertion signal blocks on BI: one Galera/client commit abort, one shutdown SIGSEGV.

Note: one Cozi node (172.31.0.101) has seen zero crashes in the current log window despite identical build and workload class. Per-node asymmetry is consistent with race-based bugs rather than data corruption on shared content (cluster replicates everything, so a shared on-disk corruption would not be node-local).

Representative stacks (sanitised)

(a) Applier path — BI 172.35.0.163, 2026-05-09

[MY-013183] [InnoDB] Assertion failure: btr0cur.cc:298:
  btr_page_get_prev(get_block->frame, mtr) == page_get_page_no(page)

#4  btr_cur_latch_leaves
#5  btr_cur_search_to_nth_level
#7  row_ins_clust_index_entry_low
#8  row_ins_clust_index_entry
#9  row_ins_step
#11 row_insert_for_mysql
#12 ha_innobase::write_row
#13 handler::ha_write_row
#14 Write_rows_log_event::write_row
#15 Write_rows_log_event::do_exec_row
#17 Rows_log_event::do_apply_event
#19 wsrep_apply_events
#21 Wsrep_applier_service::apply_write_set
#31 start_wsrep_THD

(b) Client SQL path — Cozi 172.31.0.102, 2026-05-01

[MY-013183] [InnoDB] Assertion failure: btr0pcur.cc:383:
  cur_page == prev_of_next

#4  btr_pcur_t::move_to_next_page
#5  btr_pcur_t::move_to_next
#6  row_search_mvcc
#7  ha_innobase::general_fetch
#8  handler::ha_index_next
#9  handler::read_range_next
#11 ha_innobase::read_range_next
#12 handler::multi_range_read_next
#14 IndexRangeScanIterator::Read
#15 FilterIterator::Read
#16-25 NestedLoopIterator::Read (x10)
#26 LimitOffsetIterator::Read
#27 Query_expression::ExecuteIteratorQuery
#29 Sql_cmd_dml::execute_inner
#31 mysql_execute_command
#33 wsrep_dispatch_sql_command
#34 dispatch_command
#36 threadpool_process_request
#37 worker_main

In addition there are smaller numbers of crashes in purge/background threads and one semaphore wait timeout / hang. Full stacks for all 23 events can be shared privately if helpful.

What we’ve considered (and ruled in / out)

  • Single optimistic-parallel-apply bug? This was our initial hypothesis given the applier-path majority and our settings (cert.optimistic_pa=YES, wsrep_slave_threads=24, AHI on). But it cannot directly explain the 6 client-SQL-path crashes or the purge-path crashes, which never enter the apply pipeline. So at most, optimistic PA is a contributor to the applier subset, not the whole picture.
  • Latent on-disk B-tree corruption from applier crash, biting later SELECTs? Weakly supported. On Cozi 172.31.0.102, the 2026-05-01 client SELECT crash was preceded by applier crashes on the same node, but ~20 days earlier — not the hours-to-days window you’d expect if it were the same corrupted page being re-read. And the Cozi 172.31.0.103 client crash on 2026-03-31 was followed (not preceded) by a same-node applier/IST crash about 22 minutes later. So we don’t have a clean “applier crash → latent corruption → client crash” timeline.
  • PXC-3729 (fixed 8.0.25-15.1): same family (parallel applier conflict) but fix is years old and should be in our build.
  • PXC-4173, PXC-4498 (fixed 8.4.4-4): parallel-replication stalls, not crashes — unlikely match.
  • Bug #118705 / #38310595 (fixed 8.4.8-8): “concurrency flaw in InnoDB … que_eval_sql interface” — only candidate newer-version fix we’ve identified.

We searched bugs.mysql.com and Percona Jira for the specific assertion lines (btr0cur.cc:298, btr0pcur.cc:383) under both applier and client-SQL paths and didn’t find an exact match — happy to be pointed at one we missed.

Questions

  1. Are these one bug family or several? Can the community / Percona engineers tell from the signatures whether btr0cur.cc:298 (in btr_cur_latch_leaves during INSERT) and btr0pcur.cc:383 (in persistent-cursor move_to_next_page during SELECT range scan) plausibly share a root cause, or should we treat them as independent issues?
  2. Is there a known bug for either signature on 8.4.6? Any specific Jira ticket(s) we should track?
  3. Has either signature been fixed in 8.4.7-7 or 8.4.8-8? We are currently on 8.4.6-6.1 and would prefer to skip 8.4.7 and go straight to 8.4.8-8 if that is the version with the fix. Confirmation either way would help us scope the upgrade work.
  4. For the applier subset specifically, would the community recommend disabling cert.optimistic_pa, disabling AHI, or reducing wsrep_slave_threads as the lowest-risk mitigation while we plan an upgrade?
  5. Diagnostics: any wsrep_debug flags or InnoDB diagnostic settings you would recommend enabling so the next occurrence captures more useful detail?

Thanks in advance for any pointers.

@abhi_sati I tried to check the crash pattern with our Jira and both upstream; however didn’t find anything conclusive with respect to the behaviour you are facing.

Whether to tweak those variables or not depends on some historical details, like SHOW ENGINE INNODB STATUS, Processlist or if we have any clear mutex/semaphore details, etc.

The below details could be useful from the crashing instance. Can you please share those ?

  1. PXC/MySQL error logs
  2. Kernel/os logs (dmesg -T)
  3. Output of “SHOW ENGINE INNODB STATUS\G” or if we have any queries/workload details.

I didn’t see any exact changes in the latest version also [PXC 8.4.7, 8.4.8] which I can suggest directly to upgrade; however, if the crash happens repeatedly, then it’s worth testing the same configuration/workload on the latest version.

Maybe you can also share with us the configuration [my.cnf] from one of the crashing nodes.

Hi Anil,

I hope you’re well.

My apologies for the late reply. I was off on holiday and team members forgot to check my emails.

Please find attach some output from our setup. Hopefully, it will help you to diagnose better.

Kind regards,
Abhi

(Attachment pxc-analysis-cozi-percona-2-20260702T094020Z.s20b9y.tar.gz is missing)