Database cluster hang problem

Hello,

We have a 5 node percona xtradb cluster structure and all nodes work as master-master. We are using MySQL version 8.0.29. 3 nodes are located on dc1. 2 nodes are located on dc2. I added a slave db node to one of the nodes located on dc2 with a 3 hour delay replication. After about half an hour, all the master servers were down.
note: I installed the slave db database server I added as percona server. In the attachment, I have added the log of the node - slave db’ log and the log of the master server to which the slave db is connected. I would like you to share your knowledge and experience about the operability of this structure and the reason for its closure.
dc2_error_20_158_slave.log (7.1 KB)
error_19_156_other.log (16.0 KB)
dc1_error_19_156_main.log (370.5 KB)
dc2_error_20_157.log (41.8 KB)

Hi @bthnklc

I think there are several things in your configuration that can lead to problems:

  • PXC 5 node cluster split in two DCs (If, for whatever reason, communication is broken dc2 will go down)
  • PXC with multiple writing nodes and foreign keys.

It looks like something happened to your cluster, and then there was a cascade of events that brought the whole cluster down.

I recommend that you try to simplify the architecture to find the root cause of the problem:

  • Do not write into multiple nodes.
  • Remove triggers (or use them only for integrity validation, not to propagate changes).

Pep

There is no problem in the connection between DC1 and DC2. It never happened. We use 10gig connection. At the existing 5 knotted structure, applications are adjusted to only one server on the DNS. There is a question here. If any writing process is performed on one of the other nodes in the structure I specified, does the cluster work?

2023-02-20T08:00:29.003926Z 0 [Note] [MY-000000] [Galera] (92d437c2-a685, 'ssl://0.0.0.0:4567') connection to peer f31f99f1-92e7 with addr ssl://172.19.0.157:4567 timed out, no messages seen in PT3S, 

Looks like there were some timeouts. This is why I was talking about possible network issues.

The cluster should work if any node gets writes, but this can cause concurrency issues.

Thank you for the information. Regardless of this issue, I have one more question. I start bootstrap from a server in a 3-node cluster structure and start the other nodes in order. When restarts are required, I can restart other nodes one by one with systemctl restart, but the node I started with bootstrap does not restart when shut down. I would like your help on how to proceed here.

Do you mean the database does not restart when you restart the node? Or if you issue a systemct restart mysql does nothing?

In the second case, try the following:

systemctl stop mysql@bootstrap # only if the other nodes are running!
systemctl start mysql

Then check if systemctl restart mysql works.