One node gets all queries stuck and collapse due to too many connections

jroctanio · November 9, 2017, 3:22am

Environment:
3 node percona cluster 5.7.19-29.22-3
Ubuntu 16.04 LTS

Every certain time, one of the nodes gets all queries stuck and collapse due to too many connections. It is not always the same node and it is not always the same queries.

The time between this events may vary, but we have found them between 6 hours and 10 hours.

The only solution for us is to kill the mysqld process and start it again because shutting down the server gracefully is not possible.

I attach relevant graphs during the event.

Anyone with the same problem or an idea of how this could be solved?

Kind Regards,
Jose

beautivile · November 9, 2017, 9:56am

You are not alone in this. There is another post ([url]https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster/11056-too-many-connections-brings-down-cluster/page2[/url]) that may relate to your issues. We have ended up having to setup monitoring to alert as soon as connections go over 125% of normal so we can restart the process or kill the processes and resync.

jroctanio · November 10, 2017, 1:56am

At first we thought on a solution like yours but in the end we’ve had to replace pxc due to this problem as this is not acceptable in a production environment. We’ll keep an eye on these threads for an eventual solution of this bug.

Thank you for you answer beautivile .

krunalbauskar · November 13, 2017, 12:15am

I presume you are hitting following issue as symptoms are same

commit 35cee763032246f809a84b92cbe014dbfa081972
Author: Krunal Bauskar <krunal.bauskar@percona.com>
Date: Fri Oct 13 21:42:08 2017 +0530

PXC#877: PXC hangs on what looks like an internal deadlock
PXC protects wsrep_xxxx state variables through a native mutex
name LOCK_wsrep_thd.
This mutex should be held only while checking the state
and should be released immediately.
In the said bug, this mutex was being held during complete
cleanup. Other thread with conflicting lock, booted up
at same time and tried and invoked kill action on the said
thread. For invoking kill action it acquired transaction mutex
and LOCK_wsrep_thd in respective order. Said thread already
had LOCK_wsrep_thd and tried acquring transaction mutex
leading deadlock.

BTW the issue was introduced in 5.7.17 and present in 5.7.19. (Will be solved in 5.7.20).
If you are experiencing it on release before 5.7.17 or 5.6.x then this is not the issue.

krunalbauskar · November 13, 2017, 12:18am

jroctanio,

PXC will miss you. May be you can give it another try with 5.7.20.

Also, would love to hear any other issues you may face in future. Will try out best to resolve them.

szuta · November 28, 2017, 4:49pm

Hi krunalbauskar,
Do you have an idea under which circumstances this bug might occur? We ran into it in our production instance (5.7.19-29.22-3 on Ubuntu 16.04 LTS) with a 3 node cluster several times yesterday, with symptoms matching those of the original post. Our solution was to revert back to 5.7.16-27.19.

For reference, we’d see errors like these:

2017-11-28T01:45:45.840545Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140037937301248 has waited at lock0lock.cc line 4427 for 365.00 seconds the semaphore:
Mutex at 0x7f967c634068, Mutex LOCK_SYS created lock0lock.cc:451, lock var 1

2017-11-28T01:45:45.840563Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140038020478720 has waited at lock0lock.cc line 4427 for 365.00 seconds the semaphore:
Mutex at 0x7f967c634068, Mutex LOCK_SYS created lock0lock.cc:451, lock var 1

--Thread 140038339815168 has waited at ibuf0ibuf.cc line 2531 for 365.00 seconds the semaphore:
S-lock on RW-latch at 0x7f9663ac0870 created in file buf0buf.cc line 1456
a writer (thread id 140032994010880) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file ibuf0ibuf.cc line 4568
Last time write locked in file /mnt/workspace/percona-xtradb-cluster-5.7-debian-binary/label_exp/ubuntu-xenial-64bit/percona-xtradb-cluster-5.7-5.7.19-29.22/storage/innobase/ibuf/ibuf0ibuf.cc line 4568

I chose that excerpt because thread 140038339815168 was he main thread

Process ID=181443, Main thread ID=140038339815168, state: doing insert buffer merge
Number of rows inserted 39760179, updated 55926900, deleted 38526, read 801630968
693.77 inserts/s, 0.00 updates/s, 0.00 deletes/s, 702.83 reads/s

I initially hypothesized that an increased load on the system after running for about a week in production brought to light some concurrency issues. Specifically, after upgrading from 5.6 to 5.7 we removed the innodb_locks_unsafe_for_binlog setting. When the problem was occurring, pt-stalk showed information like this:

*************************** 811. row ***************************
trx_id: 38964706
trx_state: RUNNING
trx_started: 2017-11-27 22:41:45
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 2
trx_mysql_thread_id: 16964
trx_query: INSERT INTO mdl_sessions (state,sid,sessdata,userid,timemodified,timecreated,lastip,firstip) VALUES('0','xxxxxxx',NULL,'0','1511844105','1511844105','xxx.xxx.xxx.xxx','xxx.xxx.xxx.xxx')
trx_operation_state: NULL
trx_tables_in_use: 1
trx_tables_locked: 1
trx_lock_structs: 1
trx_lock_memory_bytes: 1136
trx_rows_locked: 0
trx_rows_modified: 1
trx_concurrency_tickets: 0
trx_isolation_level: REPEATABLE READ
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0
*************************** 1. row ***************************
lock_id: 38964255:7131:449:75
lock_trx_id: 38964255
lock_mode: X
lock_type: RECORD
lock_table: `moodle`.`mdl_sessions`
lock_index: PRIMARY
lock_space: 7131
lock_page: 449
lock_rec: 75
lock_data: 224357301

Since all of our writes (and most reads) go to one node, I wasn’t expecting the isolation level to cause problems. I still changed it to READ-COMMITTED via

transaction_isolation = READ-COMMITTED

Since I’ve made some changes before discovering this thread and reverting, our new configuration looks like this.

[sst]
inno-apply-opts = "--use-memory=40G"
inno-backup-opts = "--parallel=20 --stream=xbstream --compress --compress-threads=20"
[mysqld]
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
tmpdir = /tmp
lc-messages-dir = /usr/share/mysql
datadir = /var/lib/mysql/data
innodb = FORCE
innodb_use_native_aio = 0
innodb_file_format = Barracuda
innodb_file_per_table = 1
innodb_large_prefix = ON
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
skip-character-set-client-handshake = 1
key-buffer-size = 32M
max-heap-table-size = 48M
max-allowed-packet = 16M
tmp-table-size = 48M
query-cache-type = 0
query-cache-size = 0
max-connections = 2000
thread-cache-size = 50
open-files-limit = 65535
table-definition-cache = 4096
table-open-cache = 8192
innodb-log-files-in-group = 2
innodb-log-file-size = 2G
innodb-flush-log-at-trx-commit = 2
innodb-file-per-table = 1
innodb-buffer-pool-size = 216G
innodb-buffer-pool-instances = 8
innodb-io-capacity = 6000
innodb-read-io-threads = 15
innodb-write-io-threads = 15
innodb_file_format = Barracuda
innodb_doublewrite = 1
innodb_adaptive_hash_index = False
pxc_strict_mode = PERMISSIVE
transaction_isolation = READ-COMMITTED
innodb-thread-concurrency = 32
innodb_flush_log_at_timeout = 2
innodb-thread-concurrency = 0
log-error = /var/log/mysql/mysql-error.log
log-queries-not-using-indexes = 0
slow-query-log = 0
slow-query-log-file = /var/log/mysql/mysql-slow.log
long-query-time = 2
wsrep_provider = /usr/lib/libgalera_smm.so
wsrep_node_address = 172.22.xy.10
wsrep_cluster_name = prod
wsrep_node_name = sql.ppsb.learn.illinois.edu
wsrep_cluster_address = gcomm://172.22.xy.10,172.22.xx.10,172.22.xz.10
wsrep_provider_options = "gcache.size=4G;"
wsrep_retry_autocommit = 0
wsrep_sst_method = xtrabackup-v2
wsrep_slave_threads = 4
wsrep_sst_auth = sstuser:xxxxxxx
innodb-autoinc-lock-mode = 2
log-bin = mysql-binlog
binlog_format = ROW
expire-logs-days = 2
sync-binlog = 1
innodb-flush-neighbors = 0
innodb_checksum_algorithm = crc32
innodb_log_group_home_dir = /var/lib/mysql/innodb-logs
server-id = 7
innodb_numa_interleave = 0
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4

Initially we didn’t have a double buffer (running on zfs), pxc_strict_mode was set to default, and innodb_numa_interleave was set to 1.

Our servers are PowerEdge R630’s with 256GB of RAM and an SSD for /var/lib/mysql, with /var/lib/mysql/data having a zfs recordsize of 16k and /var/lib/mysq/innodb-logs having a recordsize of 128k. Each server has 2 Intel Xeon E5-2630 v4 processors.

krunalbauskar · December 4, 2017, 2:12am

I presume you are hitting the same issue (unfortunately it was due to missing mutex-unlock). There is another issue that is related to innodb-thread-concurrency but I already see you set it to 0 after setting it to 32.
innodb-thread-concurrency = 32 innodb_flush_log_at_timeout = 2 innodb-thread-concurrency = 0

szuta · December 7, 2017, 12:16pm

Oops! My bad. You’re right that my current configuration specifies both in that order. It sounds like in order to prevent the bug from occurring in 5.7.16 I should have it set to 0?

dmitry.pronyaev · December 17, 2017, 2:02am

Hello! We have the same issue on Debian 9.2 with percona-xtradb-cluster 5.7.19.22-3 compilled with the fix of [url][PXC-847] Cluster completely locks up with "too many connections" log message - Percona JIRA . It work fine about a week and we had “Too many connections” ones again (without any high load, at night). How can we solve this problem?

Topic		Replies	Views
Cluster hangs, too many connections. Processes in 'wsrep in pre-commit stage' state. Percona XtraDB Cluster 5.x	9	10283	December 2, 2015
Deadlocks hang MySQL Cluster Node Percona XtraDB Cluster 5.x	1	770	October 14, 2018
Percona XtraDB Cluster stucks when softfailed node returns to cluster Percona XtraDB Cluster 8.x mysql , percona , closed-no-reply	0	759	August 22, 2022
Too many connections brings down cluster Percona XtraDB Cluster 5.x	34	11176	September 3, 2024
Recurring locks and slowdowns on a Percona XtraDB Cluster Percona XtraDB Cluster 5.x	0	701	September 6, 2013

One node gets all queries stuck and collapse due to too many connections

Related topics