I have 5 dedicated servers (identical machines: 32 cores, 96GB of RAM, SSD drives in RAID and gigabit ethernet link) configured with Percona XtraDB Cluster.
There’s a recurring problem causing a severe slowdown of the cluster for usually about 30 to 60 seconds, but sometimes it becomes stuck for up to 5-10 minutes.
The system is used for a busy network of websites and I use mysql-proxy on each webserver to load balance the traffic to the database.
The issue is not present if only one node is enabled. With every added node instead the problem increases in intensity (amount of time the queries are slowed/locked up) until it becomes very unbearable with 4 nodes active (the cluster at this point is not able to recover automatically anymore).
Here’s the detailed symptoms:
- Every 5 to 15 minutes all of the write queries (INSERTs/UPDATEs) become stuck in the queue of every node. Some of the queries are dispatched after 45-50 seconds, while others are completely stalled.
- Most of the time, after 30 to 60 seconds the cluster is somehow able to catch up and it quickly dispatches the queries in a matter of 1-2 seconds.
- Sometimes the cluster is not able to deal with these stuck queries automatically and I need to manually disable the busiest websites so the load is lowered and after 30ish seconds of having next to no load the cluster is again able to dispatch all the queries.
- The error logs are usually clean, with no error messages before or after the slowdown occurs. Rarely I get something like this (maybe 1 time out of 10):
130906 9:53:27 [Note] WSREP: (3f3abd42-15bc-11e3-b38b-2e049b972e3b, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://IPOFONEOFTHENODES
130906 9:53:27 [Note] WSREP: (3f3abd42-15bc-11e3-b38b-2e049b972e3b, ‘tcp://0.0.0.0:4567’) turning message relay requesting off
I usually have a wsrep_cert_deps_distance of about 400 under normal load. As soon as the slowdown begins the wsrep_cert_deps_distance slowly increases until the 2k-3k range (when it hits the 3k mark I need to manually disable the application or the cluster is not able to recover by itself)
Monitoring with mytop and atop I notice no high load in the server or in the mysql process. The CPU usage is always reasonably low (about 25% of the maximum) both during normal operation and during the slowdowns. I/O usage is fine, plenty of RAM free, vmcom under the limit.
I use myq_status to monitor the cluster on every node in realtime and this is what happens:
- The wsrep_flow_control_paused variable is always at 0.0 even when the slowdowns occur.
- No wsrep_local_bf_aborts or wsrep_local_cert_failures occur.
- On every node the outbound replication is usually 0 and increases up to 200-300 when the slowdown occurs.
- The inbound replication is always 0 on every node (rarely 1, but it happens even under normal load). This puzzles me as apparently there’s no slow node in the cluster.
- After 10-15 seconds from the beginning of the slowdown, the ops and bytes sent and received become 0 on every node. They stay at 0 for one or two seconds, then an increased amount of operations and bytes occurs the next second, coupled with an high number of “oooe” operation (out of order execution)
This repeats every few seconds until the server goes back to normal.
Here’s the details of the tests I performed to try and troubleshoot the issue (without any luck…):
- I checked the network first: the servers are in the same rack with a dedicated gigabit network and everything seems to be working fine, with no packet loss or other apparent network issues.
- I checked the bandwidth usage: every node uses an average of 30 to 100mbps (megabit) of bandwidth. I check in realtime with “iftop” and while the problem is occurring the bandwidth usage is usually less than average (15 to 30mbps). While syncing a node bandwidth goes up to 800-900mbps (as it should be) so I don’t think the network is saturated.
- I tried a combination of all nodes to make sure one particular node was affecting everything else: the problem is always present no matter which node I disable or use. The problem is always related to the number of nodes active at the same time.
Has anybody ever encountered a similar issue?
Thanks in advance!