Cluster hangs, too many connections. Processes in 'wsrep in pre-commit stage' state.

Hi,

I am running 5.6.15-56-log Percona XtraDB Cluster (GPL), Release 25.5, Revision 759, wsrep_25.5.r4061 on Fedora 20. I have three nodes within the cluster happily doing their thing for the most part yet when we start experiencing high traffic the cluster will start locking up.

The connection limit will be reached quickly with most processes in the list showing ‘wsrep in pre-commit stage’. The queries are all INSERT and UPDATE on the same table (which has a primary key).

The logs don’t show anything of interest other than ‘[Warning] Too many connections’.

I have set the gcs.fc_limit to 1000 which has helped reduce the number of times the cluster locks, however I cannot eliminate the problem completely.

Other threads have suggested checking:

SHOW STATUS LIKE ‘Threads%’; and

SELECT substring_index(host, ‘:’,1) AS host_name,state,count(*) FROM information_schema.processlist GROUP BY state,host_name; Unfortunately I haven’t been able to execute them while the problem is occurring yet.

Before the cluster we had a Master-Slave setup in place rather than Master-Master. If this problem cannot be addressed is there an easy way to revert to a Master-Slave setup?

Thanks.

Hi sjregan, did you find any resolution to this i am getting the same error on my production xtradb cluster, when we try to alter a table with 500K rows
here is the package version
||/ Name Version Description
++±===========================================-===========================================-======================================================================================================
un percona-server-client-5.1 (no description available)
un percona-server-client-5.5 (no description available)
un percona-server-common-5.1 (no description available)
un percona-server-common-5.5 (no description available)
un percona-server-server-5.1 (no description available)
un percona-server-server-5.5 (no description available)
ii percona-toolkit 2.2.7 Advanced MySQL and system command-line tools
ii percona-xtrabackup 2.1.8-733-1.precise Open source backup tool for InnoDB and XtraDB
un percona-xtradb-client-5.0 (no description available)
ii percona-xtradb-cluster-client-5.5 5.5.34-25.9-607.precise Percona Server database client binaries
ii percona-xtradb-cluster-common-5.5 5.5.34-25.9-607.precise Percona Server database common files (e.g. /etc/mysql/my.cnf)
un percona-xtradb-cluster-galera (no description available)
ii percona-xtradb-cluster-galera-2.x 163.precise Galera components of Percona XtraDB Cluster
un percona-xtradb-cluster-galera-25 (no description available)
ii percona-xtradb-cluster-server-5.5 5.5.34-25.9-607.precise Percona Server database server binaries
un percona-xtradb-server-5.0 (no description available)

and i am running on a VM with 4GB Ram

Increasing gcs.fc_limit is the correct workaround but setting it to 1000 seems to be too much. It’s default is 16. You should also check disk IO latency and also review hardware settings which might need to be tuned for better performance.

hi jrivera, Thanks for the response, but these are machines in the cloud, so i am not sure how and which hardware settings i can change, i am attaching the plots from our nagiosgraphs that show the CPU usage, Disk IO and Memory consumption respective, do you see anything standing out?

photoid=35064

well the photo doesn’t seem to upload in the right size, not sure how to send you the image, all the stats seems to quite low, disk IO avg to about 150, CPU idle is quite high as well, memory used for active data is about 67%, total used is about 80%, so i am not sure what could be contributing to this slowness in the cluster

The article here [url]http://www.percona.com/blog/2013/05/02/galera-flow-control-in-percona-xtradb-cluster-for-mysql/[/url] mentions about three parameters do i need to adjust all of them? gcs.fcs_limit, gcs.fc_master_slave, gcs.fc_factor ? is it safe to let the cluster lag and will that eliminate the " wsrep in pre-commit stage" messages?

Was this problem solved? What’s the current status?

no it wasn’t solved, but we have just upgraded the db cluster to 5.5.41 will test tomorrow if that helped

Couldn’t wait so to hear back from you, cheers man!

I am seeing the exact same issue as the OP, the cluster replication will become paused with usually 4 or 5 MySQL processes in the “wsrep in pre-commit stage” state. While stuck like this all writes to the cluster are blocked and connections build up until the limit is reached.

OS is CentOS 7.1 with current updates.

I am currently have the following Percona rpm’s installed: [INDENT]Percona-XtraDB-Cluster-client-56-5.6.26-25.12.1.el7.x86_64
percona-xtrabackup-2.3.2-1.el7.x86_64
Percona-XtraDB-Cluster-garbd-3-3.12.2-1.rhel7.x86_64
Percona-XtraDB-Cluster-full-56-5.6.26-25.12.1.el7.x86_64
percona-toolkit-2.2.11-1.noarch
Percona-XtraDB-Cluster-shared-56-5.6.26-25.12.1.el7.x86_64
Percona-XtraDB-Cluster-galera-3-3.12.2-1.rhel7.x86_64
Percona-XtraDB-Cluster-galera-3-debuginfo-3.12.2-1.rhel7.x86_64
Percona-XtraDB-Cluster-server-56-5.6.26-25.12.1.el7.x86_64
Percona-XtraDB-Cluster-56-debuginfo-5.6.26-25.12.1.el7.x86_64
Percona-XtraDB-Cluster-test-56-5.6.26-25.12.1.el7.x86_64
Percona-XtraDB-Cluster-devel-56-5.6.26-25.12.1.el7.x86_64[/INDENT] I have had zero problems with the setup until yesterday when this started occurring out of the blue with no known changes to the setup. I generally use the cluster in a way that one particular node always receives the traffic from clients, and the 2 other nodes are backup. I have noticed that while the primary is hung with the “wsrep in pre-commit stage” processes, I will find one of the other nodes will have one of its CPU’s pinned at 100%. This makes sense, this node is too busy to receive flow control data so it pauses the flow. What I can’t figure out is what exactly is what the this node is doing that has the CPU pinned as nothing talks directly to the backup nodes. They should be doing nothing but what they’re getting via replication.

We were going to move the MySQL VM’s to some new hosts with SSD disks, so this issue accelerated that plan, but that did not help this problem. Reducing the cluster to 2 nodes helped a lot, the block is till occurring but limited to around a max of 16 seconds at a time, and usually less than that. Here’s some some vmstat output from the backup server during a locked up period, 1 second intervals:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 184620 140224 13906416 0 0 0 3284 5098 2277 0 1 98 1 0
0 0 0 184636 140224 13906432 0 0 0 2044 4687 1562 0 1 98 1 0
0 0 0 184620 140224 13906468 0 0 0 1824 4683 1408 0 0 99 1 0
0 0 0 182480 140224 13908572 0 0 0 2768 5058 2213 0 1 98 1 0
0 0 0 182040 140224 13908632 0 0 16 925 4571 1156 1 0 99 0 0
1 1 0 181984 140224 13908644 0 0 0 2932 5015 2106 2 1 97 1 0
1 0 0 181984 140224 13908716 0 0 16 2172 4815 1592 1 0 99 1 0
1 0 0 200520 140224 13887852 0 0 0 1596 5606 2485 18 1 80 1 0
1 0 0 198596 140224 13887824 0 0 0 540 6242 2635 25 1 74 0 0
1 0 0 198596 140224 13887888 0 0 0 1024 4488 756 25 0 75 0 0
1 0 0 198596 140224 13887984 0 0 0 504 4186 291 25 0 75 0 0
1 0 0 198348 140224 13888148 0 0 0 4956 6483 4044 23 1 74 1 0
1 0 0 198224 140224 13888212 0 0 0 672 4214 340 25 0 75 0 0
2 0 0 198224 140224 13888692 0 0 416 1008 4278 476 25 0 75 0 0
1 0 0 196052 140224 13888788 0 0 0 696 4226 351 25 0 75 0 0
1 0 0 196052 140224 13888820 0 0 0 468 4190 303 25 0 75 0 0
1 0 0 196052 140224 13888904 0 0 16 676 5026 2155 26 1 72 0 0
1 0 0 196052 140224 13888936 0 0 0 536 4198 352 25 0 75 0 0
1 0 0 196052 140224 13888968 0 0 0 472 4196 316 25 0 75 0 0
1 1 0 195432 140224 13889096 0 0 16 24276 17165 20542 15 4 73 7 0
1 1 0 195308 140224 13889604 0 0 80 26192 17103 23298 3 4 84 9 0
0 0 0 193200 140224 13891760 0 0 0 19232 13126 17738 2 3 89 6 0
0 0 0 190780 140224 13892060 0 0 16 18900 11679 16746 2 3 89 6 0
0 0 0 190656 140224 13892232 0 0 0 19852 10761 14649 3 3 89 5 0
0 1 0 190408 140224 13892492 0 0 32 18364 11452 16219 2 3 90 6 0
1 1 0 190160 140224 13892692 0 0 0 17344 11526 15923 2 3 90 5 0
0 0 0 189896 140224 13892872 0 0 0 17187 10913 15151 1 3 91 5 0
0 1 0 189772 140224 13893064 0 0 0 8056 7656 7606 1 1 95 2 0
0 0 0 189648 140224 13893144 0 0 0 9432 9077 9061 1 2 94 3 0
0 0 0 188780 140224 13893240 0 0 0 9840 8704 9374 1 2 94 3 0
0 0 0 188772 140224 13893356 0 0 0 10252 8135 9078 1 2 94 3 0
0 0 0 186912 140224 13895516 0 0 16 9584 7929 8653 1 2 94 3 0
0 0 0 186912 140224 13895568 0 0 0 2408 4949 1992 0 0 99 1 0
0 0 0 186912 140224 13895580 0 0 0 928 4414 775 0 0 99 0 0
0 0 0 186912 140224 13895588 0 0 0 1020 4530 1063 0 0 99 0 0

The server has 4 cores, you can see about a quarter of the way down where CPU usage jumps to ~25%(1 pinned core), during that time the cluster is essentially locked to writes.

I have tried tweaking the gcs.fc_limit value to something much higher than 16 to no avail, although I have been adjusting it dynamically.

Any suggestions would be much appreciated, thanks

Jon