Database not responding after adding a server to the cluster

We use a production mysql database in a three-node structure. all nodes are running master-master versions 8.0.25-15.1 Percona XtraDB Cluster. Later, I positioned two more servers at the remote location in the master location and included them in the cluster. (new server version is 8.0.29…) In this way, I have established a structure with 5 nodes in total. After about half an hour, the db became unable to respond and a lock wait timeout exceeded error was seen in the monitoring agents.

The wsrep config defined on the nodes is as follows:

node1 : wsrep_cluster_address=gcomm://
node2 : wsrep_cluster_address=gcomm://node1
node3 : wsrep_cluster_address=gcomm://node1,node2
node4 : wsrep_cluster_address=gcomm://node1,node2,node3
node5 : wsrep_cluster_address=gcomm://node1,node2,node3,node4

Is it because I didn’t give the wsrep parameters correctly? Or maybe I had an interruption due to a parameter I forgot or overlooked?

I would you appreciate your help.
Best regards.

1 Like

@bthnklc,
You should not have installed 8.0.29 on the new nodes. All nodes must match version. Additionally, wsrep_cluster_address should be the same on all nodes, including node1. When you bootstrap the cluster use systemctl start mysql@bootstrap instead of using an empty gcomm://

1 Like

I used systemctl start mysql@bootstrap when starting mysql cluster. I have no problem at this stage. "Additionally, wsrep_cluster_address should be the same on all nodes, including node1." I don't quite understand what to do from this part. Is it possible for you to explain?
1 Like

Hi bthnklc,

You should use same “wsrep_cluster_address” value (i.e wsrep_cluster_address=gcomm://node1,node2,node3,node4, node5) on all nodes and not an incomplete list on each.

Can you further explain what happened to you ?
From what I understand: Your 5 node cluster run fine for some time. Then you added 2 more nodes , they successfully joined the cluster but 1 hour later there was some DB that became unresponsive. if something different please clarify

Did you see any error on the log?
Did you see flow control?
Did you run any DDL?
Was just 1 node (or a few nodes) unresponsive or every node was unresponsive?

Regards

1 Like

Hi @CTutte ,

We have a database system that works 24/7. Our prod system works with 3 nodes. wsrep configuration is as follows.

node1 : wsrep_cluster_address=gcomm://
node2 : wsrep_cluster_address=gcomm://node1
node3 : wsrep_cluster_address=gcomm://node1,node2

I added 2 more nodes to our running prod system. wsrep configuration is as follows.

node4 : wsrep_cluster_address=gcomm://node1,node2,node3
node5 : wsrep_cluster_address=gcomm://node1,node2,node3,node4

I started bootstrap and the nodes are included in the cluster.

After about an hour, the database became unresponsive. Nothing appeared in the database error logs. – no node is unable to respond

Error coming to application and monitoring tools connecting to database ‘mysql lock wait timeout exceeded’ –

Sorry for getting back to you late.

1 Like

hi again, any update please @CTutte @matthewb

1 Like

Hi again,

Set aside the way you configure the nodes with “wsrep_cluster_address”. I.e this is NOT the problem that you are having, it is just a complicated way of joining the nodes.

About the issue you are having, with the lack of information is not possible to accurately tell you what happened.
We can guide you troubleshoot it from your side because many things can happen. Some questions that you should review are:

When you say that the datbaase “became unable to responde” do you mean that:

  • it has stalls?
  • nodes leaving the cluster?
  • errors on the error log?
  • high latency ?
  • performance issues?
    The above happens constantly/frequently/rarely ? randomly? at peak load or at specific times?

Disconnecting the last 2 nodes fixes the problem?

Do you see flow control?
Do all the nodes have same hardware and configuration?
Are all the nodes connected with low latency?
Has application configuration behavior/queries changed lately?

All of the above (and more) can have an impact in cluster performance, but it is still not clear what type of issue and frequency are you getting.

Regards

1 Like

@bthnklc,

Please upgrade all nodes so they are the same version.

1 Like

Thank you for your answer. @CTutte I may have given incomplete information because I encountered such a situation for the first time, because I could not reach any data related to the reason. I’m sorry for that.

When i say that the databease “became unable to respond” i that mean that:

The servers are on, the database is on, but after connecting to mysql through any node, I couldn’t answer any queries.
No new connection was received by the application and existing connections became unresponsive.

After closing the nodes that I added later and which are in the last version, the same problem continued. I had to restart the existing 3 nodes and the problem was solved. I am back to the original 3 nodes. I am not adding new nodes at the moment.

1 Like

I performed the upgrade for the 3 existing nodes. If no other specific information comes in, I will try again in this way. Thank you. @matthewb

1 Like

I also suggest you install PMM Percona Monitoring and Management to monitor server resourcs and MySQL behavior

if some limits are reached i.e max memory, max connections, flow control etc… you will be able to check the info after the issue to correctly identify what happened at the time

I will repeat the work, I hope it does not cause interruptions again.
Thank you very much! @CTutte and @matthewb

1 Like