Nodes got aborted one by one when creating/droping index

hung_le · July 11, 2025, 2:13pm

Hi all,

I’m having a 3 nodes Cluster.

mysql> SHOW VARIABLES LIKE ‘version%’;
±------------------------±--------------------------------------------------------------------------------------+
| Variable_name | Value |
±------------------------±--------------------------------------------------------------------------------------+
| version | 8.0.37-29.1 |
| version_comment | Percona XtraDB Cluster (GPL), Release rel29, Revision d29a325, WSREP version 26.1.4.3 |
| version_compile_machine | x86_64 |
| version_compile_os | Linux |
| version_compile_zlib | 1.2.13 |
| version_suffix | .1 |
±------------------------±--------------------------------------------------------------------------------------+
6 rows in set (0.01 sec)

On top of those 3 nodes, I have a proxysql:

±-------------±-----------±-----±----------±--------±-------±------------±----------------±--------------------±--------±---------------±--------+
| hostgroup_id | hostname | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
±-------------±-----------±-----±----------±--------±-------±------------±----------------±--------------------±--------±---------------±--------+
| 10 | prod-db-01 | 3306 | 0 | SHUNNED | 10000 | 0 | 1000 | 0 | 0 | 0 | |
| 10 | prod-db-02 | 3306 | 0 | ONLINE | 10000 | 0 | 1000 | 0 | 0 | 0 | |
| 10 | prod-db-03 | 3306 | 0 | SHUNNED | 9999 | 0 | 1000 | 0 | 0 | 0 | |
| 20 | prod-db-01 | 3306 | 0 | ONLINE | 10000 | 0 | 1000 | 0 | 0 | 0 | |
| 20 | prod-db-03 | 3306 | 0 | ONLINE | 9999 | 0 | 1000 | 0 | 0 | 0 | |
| 30 | prod-db-01 | 3306 | 0 | ONLINE | 10000 | 0 | 1000 | 0 | 0 | 0 | |
| 30 | prod-db-03 | 3306 | 0 | ONLINE | 9999 | 0 | 1000 | 0 | 0 | 0 | |
±-------------±-----------±-----±----------±--------±-------±------------±----------------±--------------------±--------±---------------±--------+

I applied write/read split:

*************************** 1. row ***************************
rule_id: 100
active: 1
username: NULL
schemaname: NULL
flagIN: 0
client_addr: NULL
proxy_addr: NULL
proxy_port: NULL
digest: NULL
match_digest: NULL
match_pattern: ^SELECT .* FOR UPDATE
negate_match_pattern: 0
re_modifiers: CASELESS
flagOUT: NULL
replace_pattern: NULL
destination_hostgroup: 10
cache_ttl: NULL
cache_empty_result: NULL
cache_timeout: NULL
reconnect: NULL
timeout: NULL
retries: NULL
delay: NULL
next_query_flagIN: NULL
mirror_flagOUT: NULL
mirror_hostgroup: NULL
error_msg: NULL
OK_msg: NULL
sticky_conn: NULL
multiplex: NULL
gtid_from_hostgroup: NULL
log: NULL
apply: 1
attributes:
comment: NULL
*************************** 2. row ***************************
rule_id: 200
active: 1
username: NULL
schemaname: NULL
flagIN: 0
client_addr: NULL
proxy_addr: NULL
proxy_port: NULL
digest: NULL
match_digest: NULL
match_pattern: ^SELECT .*
negate_match_pattern: 0
re_modifiers: CASELESS
flagOUT: NULL
replace_pattern: NULL
destination_hostgroup: 30
cache_ttl: NULL
cache_empty_result: NULL
cache_timeout: NULL
reconnect: NULL
timeout: NULL
retries: NULL
delay: NULL
next_query_flagIN: NULL
mirror_flagOUT: NULL
mirror_hostgroup: NULL
error_msg: NULL
OK_msg: NULL
sticky_conn: NULL
multiplex: NULL
gtid_from_hostgroup: NULL
log: NULL
apply: 1
attributes:
comment: NULL
*************************** 3. row ***************************
rule_id: 300
active: 1
username: NULL
schemaname: NULL
flagIN: 0
client_addr: NULL
proxy_addr: NULL
proxy_port: NULL
digest: NULL
match_digest: NULL
match_pattern: .*
negate_match_pattern: 0
re_modifiers: CASELESS
flagOUT: NULL
replace_pattern: NULL
destination_hostgroup: 10
cache_ttl: NULL
cache_empty_result: NULL
cache_timeout: NULL
reconnect: NULL
timeout: NULL
retries: NULL
delay: NULL
next_query_flagIN: NULL
mirror_flagOUT: NULL
mirror_hostgroup: NULL
error_msg: NULL
OK_msg: NULL
sticky_conn: NULL
multiplex: NULL
gtid_from_hostgroup: NULL
log: NULL
apply: 1
attributes:
comment: NULL
3 rows in set (0.00 sec)

Before anything happened we ran a script (ssh directly to Node02 to run) but got error because of missing the index.

ERROR 1822 (HY000) at line 94: Failed to add the foreign key constraint. Missing index for constraint ‘fk_ms_shortname’ in the referenced table ‘catalogitems’

Then :
The problem is that we tried to create and drop indexes via DBeaver Tool (while there are user still using the system → query, update to the DB), we got the Cluster crashed (Cluster crashed around 2025-07-10T12:45).
These are log files relatively to Node01, Node02 and Node03.
Node01 log
Node02 log
Node03 log

To bring back the cluster again, I have to restart Node01 and Node03 (these are aborted), then the SST flow started.

Please help me to identify what is the root cause. I tried to reproduce the issue by running queries while creating/dropping indexes but cannot have the same issue (Nodes got aborted).

Thanks.

jrivera · July 12, 2025, 12:26am

You could be affected by this bug:

I would suggest upgrading to the fixed version (8.0.36-28) or better yet the latest release.

For DDL operations we highly recommend using pt-online-schema-change using the TOI method.

Alternatively, check this blog post for the RSU method:

Topic		Replies	Views
Percona 8.x cluster nodes dropping out unknown as to why Percona XtraDB Cluster 8.x	4	77	July 31, 2024
Percona Xtradb Cluster Master node down Percona XtraDB Cluster 5.x	1	112	January 22, 2025
All Node is crash!!!! Percona XtraDB Cluster 5.x	0	1116	February 23, 2014
Cluster corrupts after running CREATE PROCEDURE Percona XtraDB Cluster 8.x	3	509	February 8, 2024
Sort aborted: Query execution was interrupted Percona XtraDB Cluster 5.x	5	17409	June 10, 2015

Nodes got aborted one by one when creating/droping index

Related topics