1 master r/w node with others standing as backup over HAProxy
The issue happens occasionally at random (sometimes two days in a row, sometimes after a few weeks).
The active node just stops processing queries and process list grows. There’s absolutely nothing in the error log (we have warnings enabled as well). By nothing I mean
there’s no usual warnings in log in that period as well (as in usual I mean “update was ineffective” sort of thing).
After that we issue a node restart and then it starts up normally. I have zero log trace that I can analyze (at least those I know about, such as mysqld error log, syslog etc.)
It’s not an memory/CPU issue, I’m monitoring server 0/24 for performance and there’s no anomaly in graphs.
What’s even worse, clustercheck script sees the node as fully synced and operational. I can connect to the instance, issue a query and it gets added to process list, but waits indefinitely as do other queries issued. So we have backup nodes ready to kick in but HAProxy never detects the outage.
The my.cnf buffer and memory parameters are the same as we had with standalone Percona Server that never hung and this server has even extra 8G of RAM over the standalone instance (40GB total).
I have upped gcache.size from 128M to 4GB. Also, have modified wsrep slave threads from 1 to 32.
What worries me is that I’m getting ‘wsrep_cert_deps_distance’ no larger than 1.
I have noticed (during the yesterday’s hangup) that the oldest query among those piling up was related with “REPLACE INTO” on a myisam table (we have two altogether).
The replication queue was empty (0) and the cluster rep was not stalled.
Could this be the issue? What I’ve read mostly is that myisam replication is unreliable (regarding consistency) but nothing on the issue of hanging a node.
I could switch off myisam replication in my.cnf but am unwilling to do so until I’m certain that this is causing lockups.
Can you paste a sample show processlist part from a moment a node is hung? What are the queries waiting for? Also, I wonder if the same issue happens after converting those tables to InnoDB?
I will post the full processlist as soon as it happens again. We have upgraded nodes to 5.33, no hangups yet.
I don’t know what the queries are waiting for, they seem to be in various states of execution. There is also a certain number of ‘wsrep in pre-commit stage’ processes.
You can execute show status/variables etc, but no DB query ever gets executed.
As for InnoDB conversion, we have to keep one table on MyISAM because of the full-text search. We could turn off MyISAM replication off on the cluster but I want to make sure it is the root cause of the hangups.
We had two hangs yesterday. I haven’t been able to get a processlist since I haven’t been at PC (did restart via webmin script over smartphone), but I did find this on every node except master node (others are backup/standby nodes):
WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
131004 10:48:57 [Note] WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
131004 10:48:57 [Note] WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
131004 10:48:57 [Note] WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
131004 10:48:57 [Note] WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
131004 10:48:57 [Note] WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
131004 10:48:57 [Note] WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
The master node has absolutely nothing relevant logged until restart. I have the following entries on both hangs. What do these messages indicate? xx.xx.xx.xx is the IP of the master node.
You asked me for full processlist when cluster hangs…can I PM it to you? Don’t want to obfuscate the queries and I rather wouldn’t see DB queries shown on Google search.
We have some attacks now and then.
Sorry for long delay, I forgot about this thread.
This message:
WSREP: (xxx, 'tcp://0.0.0.0:4567') address 'tcp://xx.xx.xx.xx:4567' pointing to uuid xxx is blacklisted, skipping
is usually seen on a nodes trying to connect to the cluster, but no working node is in primary state. So looks like the master node went into non-Primary state after this hangup.
To allow them to re-join, you need to force the primary state on the surviving node like this:
SET GLOBAL wsrep_provider_options="pc.bootstrap=1";
and then starting mysql on the other nodes should succeed.
At the time of the hangup, the primary node (read/write, others a re only for backup) becomes non responsive and I’m seeing “blacklisted” messages on other nodes while the primary node has nothing in error log and never recovers until restart.
When I restart the primary node only, the rest of the cluster quickly syncs with the primary.
This is the wsrep status dump at the time of the hangup (please note that you can connect to the primary node and exec variable and status queries, but no query on actual tables succeeds, only waits indefinitely until number of connections is saturated):