We have a following setup:
- 4 Percona cluster nodes running 5.5.31-188.8.131.528. build x64
- 1 master r/w node with others standing as backup over HAProxy
The issue happens occasionally at random (sometimes two days in a row, sometimes after a few weeks).
The active node just stops processing queries and process list grows. There’s absolutely nothing in the error log (we have warnings enabled as well). By nothing I mean
there’s no usual warnings in log in that period as well (as in usual I mean “update was ineffective” sort of thing).
After that we issue a node restart and then it starts up normally. I have zero log trace that I can analyze (at least those I know about, such as mysqld error log, syslog etc.)
It’s not an memory/CPU issue, I’m monitoring server 0/24 for performance and there’s no anomaly in graphs.
What’s even worse, clustercheck script sees the node as fully synced and operational. I can connect to the instance, issue a query and it gets added to process list, but waits indefinitely as do other queries issued. So we have backup nodes ready to kick in but HAProxy never detects the outage.
The my.cnf buffer and memory parameters are the same as we had with standalone Percona Server that never hung and this server has even extra 8G of RAM over the standalone instance (40GB total).
Any ideas where to start the troubleshooting?