Urgent Problems with Out Percona XtraDB Cluster

Hi,

In the last week we have had 4 outages to our production Percona XtraDB cluster, our cluster is a 5 node configuration, then all of a sudden the writes stop working and the connections spike on one node.

This causes a cluster wide outage. We have taken out the faulty node in the cluster and stopped networking to help troubleshoot, we can’t login to MySQL on the node, which is similar to what we experienced when it was in the cluster. The cluster is working, but we need to troubleshoot this node.

Could someone please help troubleshoot. we had the issue at 9:53pm, the fix was to stop networking on the faulty node and take it out of the cluster.

Thanks

Hello abarot I asked the technical team here to take a look at your graph. In the first instance, the enormous spike in connections can be a symptom of a DoS attack.

If that’s not the case, here, they feel it’s likely that much more investigation would be needed than can we could hope to offer here on the open source forum.

You are welcome to drop me a line if you could use some further assistance? lorraine.pocklington@percona.com

Thanks Lorraine, I have sent you a email, if you could please help us that would be greatly appreciated.

Attached are our two previous outages on Thursday the 2nd 8am and Friday the 3rd of August at 4pm.

The DB Load appears to drop then the connections shoot up, the connections shooting up seem to be as a result of the DB not responding?

Hi,

There are at least three things you should collect from all nodes when such problem starts:
[LIST]
[]
[
]
[*]
[/LIST]

Thanks Przemek,

I have purposely set the maximum connections to 25k, we got nowhere near that and still ended up locked out. Is there any other parameter we can change, maybe extra_max_connections?

±--------------------------------------±------+
| Variable_name | Value |
±--------------------------------------±------+
| extra_max_connections | 1 |
| max_connect_errors | 1000 |
| max_connections | 25000 |
| performance_schema_max_cond_classes | 80 |
| performance_schema_max_cond_instances | -1 |
±--------------------------------------±------+
5 rows in set (0.00 sec)

On top of this are there any proactive measures that we can take in order to prevent this or have better logging?, Nothing shows up in the error log when we experiencing issues.

We had the issue again today. We had a connection already on the database and we received the error attached.

Can someone please look and help us?

photoid=52070

Problem seem to be related to some locking happening, and increasing the max connections to 25k may only make the issue worse. The most important thing is to identify what is really happening. From the graphs you attached, it is clear that monitoring agent keeps working during the incident as there are no gaps in the graph. So, if you configure extra_port and at least few extra_max_connections, you should be able to connect to that additional TCP port with your mysql client and check the details I mentioned earlier.
You may also use pt-stalk tool to gather these and many more details. An example command using an example extra port would be:

pt-stalk --no-stalk --host=127.0.0.1 [B]--port=2333[/B] --user=root --password=*** --dest=/tmp/ 

This should be gathered from all three nodes.

There may be also crucial information in the error logs - can you attach them?

Btw, do you monitor PXC related things, like Flow Control, wsrep_local_recv_queue, etc? You may check how we organized PXC related templates in PMM demo:
https://pmmdemo.percona.com/graph/d/s_k9wGNiz/pxc-galera-cluster-overview

Thanks Przemek,

We have rolled back the Percona release version from 5.7.22 to 5.7.21 as the issues first started after we patched the database, the current working theory is that this patch may have introduced a bug or a conflict with an existing setting.

We can look to enable the extra port and install Percona tools in order to get the necessary information during the issue, if it occurs again. Just two questions from me:

Why would the extra port help us in this case, we just seemed to get locked out of the database before we hit maximum connections, so what resource could be locked?, as showed in the above screenshot, even when we have a reserved connection to the DB, we get an error that MySQL has gone away.

Secondly, the issue happens on one node only, why does the entire cluster freeze?, shouldn’t the cluster ignore the node which is having problems at the time?,

MySQL has only one reserved connection which can be used by any user having Super privilege. But if you cannot login even before hitting max connections, it is strange that monitoring keeps working in the same time. The graphs you shared don’t show which nodes were they taken from though.

Cluster will expel a node which is down, or has network issues, but not one which is blocked internally while still responding to cluster communication. And a blocked node which cannot apply writes, will eventually trigger Flow Control - cluster wide write pause.

Again, this is all guessing - I am sorry but I don’t have enough data to make any conclusions. If other nodes were working, the mentioned details from them at least, would bring already some light to the room. Also, error logs are really important for investigation, without them I can only guess.

Thankyou

So all we can do is investigate why the DB gets blocked internally whilst the problem is occurring.

I will create the extra port and install the percona toolset in order to try and capture more information.