InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server

We have a 3 node percona cluster , all on ubuntu 16.04
mysql Ver 14.14 Distrib 5.7.19-17, for debian-linux-gnu (x86_64) using 6.3

Last night one node crashed and refuses to rejoin the cluster. When doing a wsrep-recover the log file says the following at the start:

2017-12-13T09:16:27.759954Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 1043375327479
2017-12-13T09:16:27.760085Z 0 [Note] InnoDB: Database was not shutdown normally!
2017-12-13T09:16:27.760090Z 0 [Note] InnoDB: Starting crash recovery.
2017-12-13T09:16:29.026740Z 0 [Note] InnoDB: Created parallel doublewrite buffer at /var/lib/mysql/xb_doublewrite, size 39321600 bytes
2017-12-13T09:16:29.854281Z 0 [Note] InnoDB: Transaction 3369902481 was in the XA prepared state.
2017-12-13T09:16:30.865089Z 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 0 row operations to undo
2017-12-13T09:16:30.865113Z 0 [Note] InnoDB: Trx id counter is 3369902848
2017-12-13T09:16:30.876541Z 0 [Note] InnoDB: Starting an apply batch of log records to the database…
InnoDB: Progress in percent: 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
2017-12-13T09:16:32.311863Z 0 [Note] InnoDB: Apply batch completed
2017-12-13T09:16:32.311903Z 0 [Note] InnoDB: Last MySQL binlog file position 0 256986, file name binlog.000038
2017-12-13T09:25:40.681787Z 0 [Note] InnoDB: Starting in background the rollback of uncommitted transactions
2017-12-13T09:25:40.681825Z 0 [Note] InnoDB: Removed temporary tablespace data file: “ibtmp1”
2017-12-13T09:25:40.681875Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2017-12-13T09:25:40.681849Z 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2017-12-13T09:25:40.681927Z 0 [Note] InnoDB: Setting file ‘./ibtmp1’ size to 12 MB. Physically writing the file full; Please wait …
2017-12-13T09:25:40.768967Z 0 [Note] InnoDB: File ‘./ibtmp1’ size is now 12 MB.
2017-12-13T09:25:40.774366Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2017-12-13T09:25:40.774388Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2017-12-13T09:25:40.774468Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 554744ms. The settings might not be optimal. (flushed=0, during the time.)
2017-12-13T09:25:40.790622Z 0 [Note] InnoDB: Waiting for purge to start
2017-12-13T09:29:41.799002Z 0 [Warning] InnoDB: A long semaphore wait:
–Thread 139902404835072 has waited at row0row.cc line 1073 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x7f4148f33d70 created in file buf0buf.cc line 1456

a writer (thread id 139902388049664) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: ffffffffdfffffff
Last time read locked in file row0row.cc line 1073
Last time write locked in file /mnt/workspace/percona-xtradb-cluster-5.7-debian-binary/label_exp/ubuntu-xenial-64bit/percona-xtradb-cluster-5.7-5.7.19-29.22/storage/innobase/btr/btr0cur.cc line 1911
2017-12-13T09:29:41.799057Z 0 [Warning] InnoDB: A long semaphore wait:
–Thread 139902388049664 has waited at btr0cur.cc line 1911 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f40c36f8db0 created in file buf0buf.cc line 1456
a writer (thread id 139902388049664) has reserved it in mode wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffefffffff
Last time read locked in file row0row.cc line 1073
Last time write locked in file /mnt/workspace/percona-xtradb-cluster-5.7-debian-binary/label_exp/ubuntu-xenial-64bit/percona-xtradb-cluster-5.7-5.7.19-29.22/storage/innobase/include/mtr0mtr.ic line 117
2017-12-13T09:29:41.799079Z 0 [Warning] InnoDB: A long semaphore wait:
–Thread 139925455378240 has waited at fsp0fsp.cc line 1395 for 241.00 seconds the semaphore:
X-lock on RW-latch at 0x3f3f3f8 created in file fil0fil.cc line 1392
a writer (thread id 139902388049664) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file not yet reserved line 0
Last time write locked in file /mnt/workspace/percona-xtradb-cluster-5.7-debian-binary/label_exp/ubuntu-xenial-64bit/percona-xtradb-cluster-5.7-5.7.19-29.22/storage/innobase/fsp/fsp0fsp.cc line 3373
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0

There is a lot of debugging output after this… probably not that important but I can share if need be.

And finaly at the bottom it says:
[ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.


END OF INNODB MONITOR OUTPUT

InnoDB: ###### Diagnostic info printed to the standard error stream
2017-12-13T09:41:33.810011Z 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
2017-12-13 10:41:33 0x7f3d92a4e700 InnoDB: Assertion failure in thread 139902430013184 in file ut0ut.cc line 917
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: [url]http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html[/url]
InnoDB: about forcing recovery.
09:41:33 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at [url]https://bugs.launchpad.net/percona-xtradb-cluster[/url]

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=0
max_threads=501
thread_count=0
connection_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 207511 K bytes of memory
Hope that’s ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong…
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0xed221b]
/usr/sbin/mysqld(handle_fatal_signal+0x499)[0x783ef9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f42ed7bf390]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f42ecb78428]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f42ecb7a02a]
/usr/sbin/mysqld[0x753da4]
/usr/sbin/mysqld(_ZN2ib5fatalD1Ev+0x145)[0x10b54c5]
/usr/sbin/mysqld(srv_error_monitor_thread+0xe61)[0x104e441]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f42ed7b56ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f42ecc4a3dd]
You may download the Percona XtraDB Cluster operations manual by visiting
[url]http://www.percona.com/software/percona-xtradb-cluster/[/url]. You may find information
in the manual which will help you identify the cause of the crash.

It might be that it crashed during the night, while doing an alter table.
But why it doesn’t want to rejoin the cluster is of bigger concern to me.

Any idea what is going on? Or how I can remedy it?

I know I can remove everything in /var/lib/mysql and do a full SST, but I’d like to avoid doing this …

Thanks for the help

Can you provide error logs from joiner and donor nodes to check further. Also, check If you have any innobackup-* logs exists in data directory of any of the nodes that will gives you hint as well that what stopping node to join back to cluster.

mirfan
I haven’t got any innobackup-* logs that are of use I’m afraid. I think those logs are truncated after x time…

I got a second node that crashed this weekend. With a similar error. I’m down to 1 running node in my cluster. :frowning:

2017-12-15T23:25:36.869600Z 0 [Warning] InnoDB: A long semaphore wait:
–Thread 139768016938752 has waited at dict0stats.cc line 3623 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x3d8d6a8 created in file dict0dict.cc line 1198
a writer (thread id 139768016938752) has reserved it in mode wait exclusive
number of readers 2, waiters flag 1, lock_word: fffffffffffffffe
Last time read locked in file row0purge.cc line 865
Last time write locked in file /mnt/workspace/percona-xtradb-cluster-5.7-debian-binary/label_exp/ubuntu-xenial-64bit/percona-xtradb-cluster-5.7-5.7.19-29.22/storage/innobase/handler/ha_innodb.cc line 15330
2017-12-15T23:25:36.869736Z 0 [Warning] InnoDB: A long semaphore wait:
–Thread 139733382022912 has waited at row0purge.cc line 865 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x3d8d6a8 created in file dict0dict.cc line 1198
a writer (thread id 139768016938752) has reserved it in mode wait exclusive
number of readers 2, waiters flag 1, lock_word: fffffffffffffffe
Last time read locked in file row0purge.cc line 865
Last time write locked in file /mnt/workspace/percona-xtradb-cluster-5.7-debian-binary/label_exp/ubuntu-xenial-64bit/percona-xtradb-cluster-5.7-5.7.19-29.22/storage/innobase/handler/ha_innodb.cc line 15330
2017-12-15T23:25:36.869780Z 0 [Warning] InnoDB: A long semaphore wait:
–Thread 139733356844800 has waited at btr0cur.cc line 1911 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f1b002ce790 created in file buf0buf.cc line 1456
a writer (thread id 139733356844800) has reserved it in mode wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffefffffff

I don’t understand what it is trying to tell me. It is waiting for a semaphore, why? Is this a long running query it is waiting for? Is it IO it is waiting for?
I googled quite a bit and I’m finding bug reports on mysql that mention a similar issue: [url]MySQL Bugs: #73890: Error: semaphore wait has lasted > 600 seconds but I’m not smart enough to understand what the logs are saying (I’m guessing few people are since that bug report is still open… ).

Hmmm… the semaphore wait in the first comment is not in the same place as the semaphore wait in the last comment.

Looking at these lines in the logs:

Last time read locked in file row0purge.cc line 865
Last time write locked in file /mnt/workspace/percona-xtradb-cluster-5.7-debian-binary/label_exp/ubuntu-xenial-64bit/percona-xtradb-cluster-5.7-5.7.19-29.22/storage/innobase/handler/ha_innodb.cc line 15330

The ha_innodb.cc lock is occurring in innodb_rename_table() and it’s locking the data dictionary (dict_operation_lock)
The row0purge.cc is also trying to obtain a lock on the data dictionary, (dict_operation_lock).
The dict0stats.cc is also trying to obtain a lock on dict_operation_lock (and is also trying rename a table)

I’m not familiar with this code so I’m not sure what exactly is happening (maybe a deadlock??). Were you trying to rename a table when the crash occurred?

You may want to create a bug (with all the logs) on https://jira.percona.com so that this issue can be tracked.

Based on the said lines
2017-12-13T09:41:33.810011Z 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
2017-12-13 10:41:33 0x7f3d92a4e700 InnoDB: Assertion failure in thread 139902430013184 in file ut0ut.cc line 917

maybe it is possible that you are hitting following upstream issue

[url]https://bugs.launchpad.net/percona-server/+bug/1653764[/url]