Cluster node stuck with long semaphore wait in line 789

Hi, today we’ve got another issue, on another cluster.

We have temporary architecture for migrating from old to new infrastructure configuration. So that we have MySQL Xtradb Cluster “replicating data from” another DataCenter.

Schema is like: another_dc_master -> [,, ]
GTID disabled.

Replication is stuck on with symptoms:

  1. constantly increasing replication lag, commits were applying visually due to Exec_Master_Log_Pos counter, but likely too slow
  2. after few minutes mysql client was able to connect, but not able to execute any commands locally

Significant part of the log below:

2022-02-21T16:22:11.640748Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 139623264241408 has waited at trx0rseg.ic line 50 for 241 seconds the semaphore:
X-lock on RW-latch at 0x7ef1b43a2690 created in file line 789
a writer (thread id 139623264831232) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file not yet reserved line 0
Last time write locked in file /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.23-14/storage/innobase/include/trx0rseg.ic line 50
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0

2022-02-21 16:22:20 0x7eef3d97d700 INNODB MONITOR OUTPUT
Per second averages calculated from the last 3 seconds
srv_master_thread loops: 2704167 srv_active, 0 srv_shutdown, 310 srv_idle
srv_master_thread log flush and writes: 0
OS WAIT ARRAY INFO: reservation count 7175967
--Thread 139564786370304 has waited at trx0types.h line 193 for 178 seconds the semaphore:
Mutex at 0x7efca9c1e378, Mutex UNDO_SPACE_RSEG created, lock var 1

Log below consisted of repeatedly logging Innodb Engine Status output.


mysql  Ver 8.0.23-14.1 for Linux on x86_64 (Percona XtraDB Cluster (GPL), Release rel14, Revision d3b9a1d, WSREP version 26.4.3)

What could it be?

Thank you.

1 Like

Additionally I can say another subjective observations:

We have several clusters with similar configuration, but observing such problem only in specific case. This problem is only seen on the cluster, that has pxc1 node working as a REPLICA for some external master mysql source and having more than 1 node in cluster (usually 3).

This problem is not observed after resetting slaves on pxc1, and it is not observed when pxc1 running in REPLICA mode but is standalone (only 1 active node in cluster).

1 Like