We are running into a frequent and problematic issue with our Percona XtraDB cluster which is comprised of 4 nodes and an arbitrator in AWS as EC2 instances separated into two availability zones in the N. Virginia region. We are running version Ver 8.0.36-28.1.
Every several days, one of the nodes stalls, and causes the cluster to stop responding. Essentially the cluster goes into full flow control mode while the receive queue on the node with issue builds. We consistently in the MySQL process list see tasks stuck in the freeing items step. The only way to recover seems to be kill the impacted node. The items stuck in the freeing step seems to be the oldest queries in the process list. We are unable to kill those mysql processes as we get an error like below “ERROR 1095 (HY000): The query is in TOI/NBO and cannot be killed” Most all other threads that are stuck in " wsrep: replicating and certifying write set" the innodb status from the previous incident seems like it’s waiting on an exclusive lock .
the node usually has to do SST after this occurs. we don’t have a core dump from when this is occurring, or a repeatable test case to cause the error to occur. We also appear to be having issues with donor node when doing SST to recover the cluster at times. is anyone else experiencing something like this? is there any information we can collect that would help debug the issue?
| 11584386 | scrubbed | scrubbed_IP | scrubbed _dbname | Query | 249 | freeing items | INSERT INTO `scrubbed statement’ | 0 | 0 |
mysql> kill 11584386
→ ;
ERROR 1095 (HY000): The query is in TOI/NBO and cannot be killed
Do you have any ALTER table statements running? What other queries are running? What do your disks look like? CPU?
No alter table statements running, we run any of the DDL types of schema changes off hours to mitigate the risk of cluster stalls. There are some revoke, grant user level permissions that run from time to time though. The process list table had 168 rows in at the time. Most of them at the time of the incident were the wsrep: replicating and certifying write set type relative to inserts. There is an lvm cache pool setup for reads as it is set to write through the writes. the local instance storage also has the binlogs and tmp tables storage areas. The nodes are r6id.2xlarge which are 8 VCPU and 64 GB, The nodes have EBS GPT3 volume consisting of 7 disks for MySQL data. There is about 864 GB of databases on the clustered nodes.
We do leverage grafana of course. When all four nodes are taking load during the busy parts of the day, they are all mostly around 15 to 35% CPU usage currently. Sometimes one of the nodes might go to 50% to 60%.
We just now had another event with the same context of freeing items. We were in flow control solid flow control (no writes allowed) for about 9 minutes until I shutdown MySQL on the offending node. I am curious what other kinds of information or logs might be useful to troubleshooting the issue and understanding if there some sort of a software bug.
Some mysq log info:
InnoDB: ###### Diagnostic info printed to the standard error stream
2024-05-24T15:00:36.832366Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 140346656147200 has waited at trx0undo.ic line 126 for 550 seconds the semaphore:
X-lock on RW-latch at 0x7f9c3851cdc8 created in file buf0buf.cc line 815
a writer (thread id 140305419962112) has reserved it in mode SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time read locked in file trx0undo.ic line 141
Last time write locked in file /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.36-28/storage/innobase/include/fut0fut.ic line 57
2024-05-24T15:00:36.832425Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 140306063247104 has waited at trx0types.h line 186 for 549 seconds the semaphore:
Mutex at 0x7fa4dc45e508, Mutex UNDO_SPACE_RSEG created trx0rseg.cc:235, locked by 140346656147200
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 481696
--Thread 140346656147200 has waited at trx0undo.ic line 126 for 565 seconds the semaphore:
X-lock on RW-latch at 0x7f9c3851cdc8 created in file buf0buf.cc line 815
a writer (thread id 140305419962112) has reserved it in mode SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time read locked in file trx0undo.ic line 141
Last time write locked in file /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.36-28/storage/innobase/include/fut0fut.ic line 57
--Thread 140306063247104 has waited at trx0types.h line 186 for 564 seconds the semaphore:
Mutex at 0x7fa4dc45e508, Mutex UNDO_SPACE_RSEG created trx0rseg.cc:235, locked by 140346656147200
OS WAIT ARRAY INFO: signal count 1120161
RW-shared spins 0, rounds 0, OS waits 0
RW-excl spins 0, rounds 0, OS waits 0
RW-sx spins 0, rounds 0, OS waits 0
Spin rounds per wait: 0.00 RW-shared, 0.00 RW-excl, 0.00 RW-sx
The final entry in the log before mysql finally shutdown is:
============================
InnoDB: ###### Diagnostic info printed to the standard error stream
2024-05-24T15:01:38.832854Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 140346656147200 has waited at trx0undo.ic line 126 for 612 seconds the semaphore:
X-lock on RW-latch at 0x7f9c3851cdc8 created in file buf0buf.cc line 815
a writer (thread id 140305419962112) has reserved it in mode SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time read locked in file trx0undo.ic line 141
Last time write locked in file /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.36-28/storage/innobase/include/fut0fut.ic line 57
2024-05-24T15:01:38.832901Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 140306063247104 has waited at trx0types.h line 186 for 611 seconds the semaphore:
You have UNDO contention. I would be watching this server for disk latency (await in top
) because InnoDB is unable to write/purge UNDO fast enough on this specific server. Since this only happens on this server, then something is odd/wrong with this server’s disk.
I suggest adding more undo spaces, and ensure you have undo truncation enabled. By adding 2 or 3 more UNDO files, you might help with whatever disk issue is troubling this server.
https://dev.mysql.com/doc/refman/8.0/en/innodb-undo-tablespaces.html#innodb-add-undo-tablespaces
The issue happens randomly to different nodes. These are just the specifics of this event and one node. I would suspect on Tuesday it will be a different node. We have been looking closely at cloud watch disk metrics but they do not seem saturated leading up to the event. We also have watching via Grafana since it captures more fine sampling. And sometimes we monitor iostat. We will surely research the undo files topic and appreciate your recommendation. If you have any other thoughts, greatly appreciate it as well.
We have adjusted the undo table spaces and also monitored disk IO further. We also changed the 7 GP3 EBS volumes to 6000 IOPS and 250 of throughput and are now-monitoring various disk, innodb, and other grafana metrics closely. Are there any other specific MySQL or Galera settings that might be worth studying in further detail based on your vast and in-depth experience troubleshooting these types of issues?
Based on my experience with AWS, when you see stalls in MySQL, it’s usually disk related. Maybe you used up too much credits, or the disk needed to expand and that small pause to expand the EBS caused a backup of writes which the system could not overcome.
I appreciate your help and thoughts. We have monitored the EBS volumes which are GP3 and GP3 is not a burstable volume type like GP2. We have these currently set at 6000 IOPS and 250 MBS of throughput. We have been monitoring closely via Grafana and Cloudwatch all last week. We have not noticed anything regarding high latency on writes, excessive IOPS, or throughput relative to the limits we increased them to. We are seeing that SST operations where the donor is in the same AZ seem to be pushing the cluster in flow control. We kill the SST then the flow control goes away. We throttled it to 100m via [sst] rlimit option which seems too aggressive. When the SST operation goes across AZs, it behaves fine.
I do have a question on the setting gcs.max_throttle which we have set way about the default currently at .75 but was at .99. Trying to understand this setting better as it seems like it might be related to the SST overrun as well. Would this setting have any effect on controlling the pace of replications between the cluster nodes. The more we read, the more the setting sort of confuses us. And it seems like it might be an impactful setting to look at.
As an update: adding the undo logs and increasing the GP3 volume IOPS/throughput seems to have stabilized this particular issue quit a bit. We are still actively monitoring.
Ok we just had a current event this morning after a month of stability with this issue. The event took place before most users logged in and began work and the activity on the nodes was extremely low like less than 5% on all four of them. I still believe that 8.0.36 has a software bug in it related to these mutex’s and undo log errors. We do have 5 logs, all can grow to 3GB. at the time of the issue with freeing items and full on flow control all four nodes were around 35 MB of so consumed in the undo logs. Obviously no disk latency issues or such.
InnoDB: ###### Diagnostic info printed to the standard error stream
2024-07-08T12:03:03.205970Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
–Thread 126914862233152 has waited at buf0flu.cc line 1376 for 453 seconds the semaphore:
SX-lock on RW-latch at 0x73727a9ab908 created in file buf0buf.cc line 815
a writer (thread id 126887653004864) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file not yet reserved line 0
Last time write locked in file /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.36-28/storage/innobase/include/trx0undo.ic line 126
2024-07-08T12:03:03.206027Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
–Thread 126891728983616 has waited at trx0types.h line 186 for 361 seconds the semaphore:
Mutex at 0x7372d0461258, Mutex UNDO_SPACE_RSEG created trx0rseg.cc:235, locked by 126937158035008
2024-07-08T12:03:03.206041Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
–Thread 126893353825856 has waited at trx0types.h line 186 for 488 seconds the semaphore:
Mutex at 0x7372d0461258, Mutex UNDO_SPACE_RSEG created trx0rseg.cc:235, locked by 126937158035008
i do not mean to hijack your thread, but your issue feels like exactly what we are experiencing. This has happened only once for now. But the situation is very similar. Some logs from our instance:
2024-09-13T03:49:10.437056Z 5729634 [Warning] [MY-000000] [Galera] client_state: Unallowed state transition: 3 -> 4
2024-09-13T03:49:10.414972Z 5758706 [Warning] [MY-000000] [Galera] client_state: Unallowed state transition: 3 -> 4
2024-09-13T03:49:23.401202Z 5717458 [Warning] [MY-000000] [Galera] client_state: Unallowed state transition: 3 -> 4
2024-09-13T03:49:10.437065Z 5733058 [Warning] [MY-000000] [Galera] client_state: Unallowed state transition: 3 -> 4
2024-09-13T03:49:23.591376Z 5731994 [Warning] [MY-000000] [Galera] client_state: Unallowed state transition: 3 -> 4
2024-09-13T03:49:25.686455Z 5723193 [Warning] [MY-000000] [Galera] client_state: Unallowed state transition: 3 -> 4
2024-09-13T03:49:25.686444Z 5719559 [Warning] [MY-000000] [Galera] client_state: Unallowed state transition: 3 -> 4
2024-09-13T03:49:25.686463Z 5749364 [Warning] [MY-000000] [Galera] client_state: Unallowed state transition: 3 -> 4
................. SAME REPEATED ..........
2024-09-13T03:50:19.355000Z 0 [Warning] [MY-000000] [Server] Too many connections
2024-09-13T03:55:29.626616Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 139757768062528 has waited at log0meb.cc line 1779 for 572 seconds the semaphore:
Mutex at 0x7f20ac326980, Mutex LOG_FILES created log0log.cc:643, locked by nobody
2024-09-13T04:01:26.280107Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 139757539980864 has waited at log0meb.cc line 1779 for 816 seconds the semaphore:
Mutex at 0x7f20ac326980, Mutex LOG_FILES created log0log.cc:643, locked by 139761270818368
2024-09-13T04:07:01.785607Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 139777834563136 has waited at log0meb.cc line 1779 for 1239 seconds the semaphore:
Mutex at 0x7f20ac326980, Mutex LOG_FILES created log0log.cc:643, locked by 139761270818368
2024-09-13T04:13:01.711747Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 139757545506368 has waited at log0meb.cc line 1779 for 1616 seconds the semaphore:
Mutex at 0x7f20ac326980, Mutex LOG_FILES created log0log.cc:643, locked by 139761270818368
2024-09-13T04:19:09.941199Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
... That is the last line from the log ... The instance crashed with OOM ...
At the time of the incident the node had no significant load on it, CPU, Disk or Network (verified from CloudWatch metrics). So the too many connections (which is also shown only once) were very weird, we also have separate monitoring that connects to the node and runs some queries, these scripts stopped responding after 03:44:02. The node was completely unresponsive to external connections it seems after 03:44.
Were you able to clean a safe shutdown and bring it back via IST? Curious
Oh I see you had a OOM error. We did not have that in our case
We provided a crash dump of RAM and a ticket was created for our issue. [PXC-4498] - Percona JIRA
we were not able to cleanly shut it down, as you noticed it OOMed after a while.
Then it had to do a full SST to get back. If we have to do SST we usually destroy instances and bring new ones in case its a hardware issue on the AWS side (it has happened before).
Thanks for linking the ticket, will keep track on it, the cases may be related after all.