my percona xtraDB cluster suddently dead and how to fix it

hi,
my PXC is down and this is the error message I got from error log, please see attached.

any reason why and how to fix it ?


PXC errorr message.docx (19.2 KB)

2020-08-26T21:10:55.088728+08:00 0 [Note] [MY-000000] [Galera] (995a4f35, 'tcp://0.0.0.0:4567') connection to peer 4855f0e4 with addr tcp://&lt;IP address&gt;:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)<br><br><br><br><br>2020-08-26T21:10:55.089033+08:00 0 [Note] [MY-000000] [Galera] (995a4f35, 'tcp://0.0.0.0:4567') connection to peer f2266321 with addr tcp://&lt;IP address&gt;:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)

Your nodes lost connections to eachother. Network outage. If all nodes are offline, you need to stop them all and then re-bootstrap the whole cluster.

but I ping each other and it’s pingable!
I read that line too but it seems not the case ! by your statement, you seems telling me that standalone boot by systemctl start mysql will be ok ?

and the error message by saying nodes communicate with each other using port 4567 ? mysql is 3306 right ?
and only this cluster communicate using port 4567 … ? my other cluster do not have this kind of problem .

If you have other PXC clusters on this same network and none of those clusters are having issues, then I’d say there is an issue with the nodes of this cluster causing network timeout issues. Look at your metrics/monitoring for CPU saturation, disk IO saturation, etc. There could be something else causing the node to be unable to process network packets and thus miss heartbeats and be ejected from the cluster.

“be unable to process network packets and thus miss heartbeats and be ejected from the cluster.”

so you are sure that MUST BE network problem ! and why port 4567 ? I never use it for mysql

No, I am not sure it is directly network related. The logs say “connection timed out” which means network issues. However, many other things can cause “network issues.”

4444 is used by PXC/Galera for SST/IST. 4567 is used by PXC/Galera internal node-node communication. 3306 is used by MySQL.

“4444 is used by PXC/Galera for SST/IST. 4567 is used by PXC/Galera internal node-node communication. 3306 is used by MySQL.”

good ! and can I just telnet <nodes IP> 4444 and telnet <nodes IP> 4567 to verify it ?  I am thinking firewall block it.

Probably, yes, you can telnet to 4567 to see if you get a response from another node. 4444 only responds while an SST/IST is in progress.

" SST/IST is in progress"

replication ?

No. You need to go learn PXC Basics 101 if you don’t know what IST/SST are. Fundamental to PXC/Galera operations.

yeah!I might forget it, it is for the start of replication and the stream replication, right?

at this moment want to troubleshoot the cluster first. sorry

No. IST/SST have nothing to do with replication. SST is for when new nodes join. IST is for when nodes leave and then come back.

" SST is for when new nodes join. IST is for when nodes leave and then come back."
good and tks. will have a look later.

hi,
probably some network problem and today very funny that, the cluster up without any problem anymore without I restart /bootstrap again ! 
amazing …
one quetion is, if next time it happens again, just because of some network problem, the linkage between nodes  broken again by some reason but it recover later, will the cluster also reform automatically ? 

and I found if situation like this happen again, really need to bootstrap again even it recover itself automatically.

If all nodes are down, you must always bootstrap the first node.