We use a production mysql database in a three-node structure. all nodes are running master-master versions 8.0.25-15.1 Percona XtraDB Cluster. Later, I positioned two more servers at the remote location in the master location and included them in the cluster. (new server version is 8.0.29…) In this way, I have established a structure with 5 nodes in total. After about half an hour, the db became unable to respond and a lock wait timeout exceeded error was seen in the monitoring agents.
The wsrep config defined on the nodes is as follows:
@bthnklc,
You should not have installed 8.0.29 on the new nodes. All nodes must match version. Additionally, wsrep_cluster_address should be the same on all nodes, including node1. When you bootstrap the cluster use systemctl start mysql@bootstrap instead of using an empty gcomm://
I used systemctl start mysql@bootstrap when starting mysql cluster. I have no problem at this stage. "Additionally, wsrep_cluster_address should be the same on all nodes, including node1." I don't quite understand what to do from this part. Is it possible for you to explain?
You should use same “wsrep_cluster_address” value (i.e wsrep_cluster_address=gcomm://node1,node2,node3,node4, node5) on all nodes and not an incomplete list on each.
Can you further explain what happened to you ?
From what I understand: Your 5 node cluster run fine for some time. Then you added 2 more nodes , they successfully joined the cluster but 1 hour later there was some DB that became unresponsive. if something different please clarify
Did you see any error on the log?
Did you see flow control?
Did you run any DDL?
Was just 1 node (or a few nodes) unresponsive or every node was unresponsive?
Set aside the way you configure the nodes with “wsrep_cluster_address”. I.e this is NOT the problem that you are having, it is just a complicated way of joining the nodes.
About the issue you are having, with the lack of information is not possible to accurately tell you what happened.
We can guide you troubleshoot it from your side because many things can happen. Some questions that you should review are:
When you say that the datbaase “became unable to responde” do you mean that:
it has stalls?
nodes leaving the cluster?
errors on the error log?
high latency ?
performance issues?
The above happens constantly/frequently/rarely ? randomly? at peak load or at specific times?
Disconnecting the last 2 nodes fixes the problem?
Do you see flow control?
Do all the nodes have same hardware and configuration?
Are all the nodes connected with low latency?
Has application configuration behavior/queries changed lately?
All of the above (and more) can have an impact in cluster performance, but it is still not clear what type of issue and frequency are you getting.
Thank you for your answer. @CTutte I may have given incomplete information because I encountered such a situation for the first time, because I could not reach any data related to the reason. I’m sorry for that.
When i say that the databease “became unable to respond” i that mean that:
The servers are on, the database is on, but after connecting to mysql through any node, I couldn’t answer any queries.
No new connection was received by the application and existing connections became unresponsive.
After closing the nodes that I added later and which are in the last version, the same problem continued. I had to restart the existing 3 nodes and the problem was solved. I am back to the original 3 nodes. I am not adding new nodes at the moment.
if some limits are reached i.e max memory, max connections, flow control etc… you will be able to check the info after the issue to correctly identify what happened at the time