Percona XtraDB Cluster Ver 8.0.36-28.1 Cluster stalls

We are running into a frequent and problematic issue with our Percona XtraDB cluster which is comprised of 4 nodes and an arbitrator in AWS as EC2 instances separated into two availability zones in the N. Virginia region. We are running version Ver 8.0.36-28.1.

Every several days, one of the nodes stalls, and causes the cluster to stop responding. Essentially the cluster goes into full flow control mode while the receive queue on the node with issue builds. We consistently in the MySQL process list see tasks stuck in the freeing items step. The only way to recover seems to be kill the impacted node. The items stuck in the freeing step seems to be the oldest queries in the process list. We are unable to kill those mysql processes as we get an error like below “ERROR 1095 (HY000): The query is in TOI/NBO and cannot be killed” Most all other threads that are stuck in " wsrep: replicating and certifying write set" the innodb status from the previous incident seems like it’s waiting on an exclusive lock .

the node usually has to do SST after this occurs. we don’t have a core dump from when this is occurring, or a repeatable test case to cause the error to occur. We also appear to be having issues with donor node when doing SST to recover the cluster at times. is anyone else experiencing something like this? is there any information we can collect that would help debug the issue?

| 11584386 | scrubbed | scrubbed_IP | scrubbed _dbname | Query | 249 | freeing items | INSERT INTO `scrubbed statement’ | 0 | 0 |
mysql> kill 11584386
→ ;
ERROR 1095 (HY000): The query is in TOI/NBO and cannot be killed

Do you have any ALTER table statements running? What other queries are running? What do your disks look like? CPU?

No alter table statements running, we run any of the DDL types of schema changes off hours to mitigate the risk of cluster stalls. There are some revoke, grant user level permissions that run from time to time though. The process list table had 168 rows in at the time. Most of them at the time of the incident were the wsrep: replicating and certifying write set type relative to inserts. There is an lvm cache pool setup for reads as it is set to write through the writes. the local instance storage also has the binlogs and tmp tables storage areas. The nodes are r6id.2xlarge which are 8 VCPU and 64 GB, The nodes have EBS GPT3 volume consisting of 7 disks for MySQL data. There is about 864 GB of databases on the clustered nodes.

We do leverage grafana of course. When all four nodes are taking load during the busy parts of the day, they are all mostly around 15 to 35% CPU usage currently. Sometimes one of the nodes might go to 50% to 60%.

We just now had another event with the same context of freeing items. We were in flow control solid flow control (no writes allowed) for about 9 minutes until I shutdown MySQL on the offending node. I am curious what other kinds of information or logs might be useful to troubleshooting the issue and understanding if there some sort of a software bug.

Some mysq log info:

InnoDB: ###### Diagnostic info printed to the standard error stream

2024-05-24T15:00:36.832366Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:

--Thread 140346656147200 has waited at trx0undo.ic line 126 for 550 seconds the semaphore:

X-lock on RW-latch at 0x7f9c3851cdc8 created in file buf0buf.cc line 815

a writer (thread id 140305419962112) has reserved it in mode SX

number of readers 0, waiters flag 1, lock_word: 10000000

Last time read locked in file trx0undo.ic line 141

Last time write locked in file /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.36-28/storage/innobase/include/fut0fut.ic line 57

2024-05-24T15:00:36.832425Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:

--Thread 140306063247104 has waited at trx0types.h line 186 for 549 seconds the semaphore:

Mutex at 0x7fa4dc45e508, Mutex UNDO_SPACE_RSEG created trx0rseg.cc:235, locked by 140346656147200

InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:

InnoDB: Pending preads 0, pwrites 0

SEMAPHORES

----------

OS WAIT ARRAY INFO: reservation count 481696

--Thread 140346656147200 has waited at trx0undo.ic line 126 for 565 seconds the semaphore:

X-lock on RW-latch at 0x7f9c3851cdc8 created in file buf0buf.cc line 815

a writer (thread id 140305419962112) has reserved it in mode SX

number of readers 0, waiters flag 1, lock_word: 10000000

Last time read locked in file trx0undo.ic line 141

Last time write locked in file /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.36-28/storage/innobase/include/fut0fut.ic line 57

--Thread 140306063247104 has waited at trx0types.h line 186 for 564 seconds the semaphore:

Mutex at 0x7fa4dc45e508, Mutex UNDO_SPACE_RSEG created trx0rseg.cc:235, locked by 140346656147200

OS WAIT ARRAY INFO: signal count 1120161

RW-shared spins 0, rounds 0, OS waits 0

RW-excl spins 0, rounds 0, OS waits 0

RW-sx spins 0, rounds 0, OS waits 0

Spin rounds per wait: 0.00 RW-shared, 0.00 RW-excl, 0.00 RW-sx

The final entry in the log before mysql finally shutdown is:

============================

InnoDB: ###### Diagnostic info printed to the standard error stream

2024-05-24T15:01:38.832854Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:

--Thread 140346656147200 has waited at trx0undo.ic line 126 for 612 seconds the semaphore:

X-lock on RW-latch at 0x7f9c3851cdc8 created in file buf0buf.cc line 815

a writer (thread id 140305419962112) has reserved it in mode SX

number of readers 0, waiters flag 1, lock_word: 10000000

Last time read locked in file trx0undo.ic line 141

Last time write locked in file /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.36-28/storage/innobase/include/fut0fut.ic line 57

2024-05-24T15:01:38.832901Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:

--Thread 140306063247104 has waited at trx0types.h line 186 for 611 seconds the semaphore:

You have UNDO contention. I would be watching this server for disk latency (await in top) because InnoDB is unable to write/purge UNDO fast enough on this specific server. Since this only happens on this server, then something is odd/wrong with this server’s disk.

I suggest adding more undo spaces, and ensure you have undo truncation enabled. By adding 2 or 3 more UNDO files, you might help with whatever disk issue is troubling this server.

https://dev.mysql.com/doc/refman/8.0/en/innodb-undo-tablespaces.html#innodb-add-undo-tablespaces

The issue happens randomly to different nodes. These are just the specifics of this event and one node. I would suspect on Tuesday it will be a different node. We have been looking closely at cloud watch disk metrics but they do not seem saturated leading up to the event. We also have watching via Grafana since it captures more fine sampling. And sometimes we monitor iostat. We will surely research the undo files topic and appreciate your recommendation. If you have any other thoughts, greatly appreciate it as well.

We have adjusted the undo table spaces and also monitored disk IO further. We also changed the 7 GP3 EBS volumes to 6000 IOPS and 250 of throughput and are now-monitoring various disk, innodb, and other grafana metrics closely. Are there any other specific MySQL or Galera settings that might be worth studying in further detail based on your vast and in-depth experience troubleshooting these types of issues?

Based on my experience with AWS, when you see stalls in MySQL, it’s usually disk related. Maybe you used up too much credits, or the disk needed to expand and that small pause to expand the EBS caused a backup of writes which the system could not overcome.

I appreciate your help and thoughts. We have monitored the EBS volumes which are GP3 and GP3 is not a burstable volume type like GP2. We have these currently set at 6000 IOPS and 250 MBS of throughput. We have been monitoring closely via Grafana and Cloudwatch all last week. We have not noticed anything regarding high latency on writes, excessive IOPS, or throughput relative to the limits we increased them to. We are seeing that SST operations where the donor is in the same AZ seem to be pushing the cluster in flow control. We kill the SST then the flow control goes away. We throttled it to 100m via [sst] rlimit option which seems too aggressive. When the SST operation goes across AZs, it behaves fine.

I do have a question on the setting gcs.max_throttle which we have set way about the default currently at .75 but was at .99. Trying to understand this setting better as it seems like it might be related to the SST overrun as well. Would this setting have any effect on controlling the pace of replications between the cluster nodes. The more we read, the more the setting sort of confuses us. And it seems like it might be an impactful setting to look at.