Not the answer you need?
Register and ask your own question!

my percona xtraDB cluster suddently dead and how to fix it

DBA100DBA100 Current User Role Patron
hi, my PXC is down and this is the error message I got from error log, please see attached.

any reason why and how to fix it ?


Answers

  • matthewbmatthewb Senior [email protected] Percona Staff Role
    edited August 26
    2020-08-26T21:10:55.088728+08:00 0 [Note] [MY-000000] [Galera] (995a4f35, 'tcp://0.0.0.0:4567') connection to peer 4855f0e4 with addr tcp://<IP address>:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)

    Your nodes lost connections to eachother. Network outage. If all nodes are offline, you need to stop them all and then re-bootstrap the whole cluster.
  • DBA100DBA100 Current User Role Patron
    edited August 27
    but I ping each other and it's pingable!
    I read that line too but it seems not the case ! by your statement, you seems telling me that standalone boot by systemctl start mysql will be ok ?

    and the error message by saying nodes communicate with each other using port 4567 ? mysql is 3306 right ?
    and only this cluster communicate using port 4567 ... ? my other cluster do not have this kind of problem .
  • matthewbmatthewb Senior [email protected] Percona Staff Role
    If you have other PXC clusters on this same network and none of those clusters are having issues, then I'd say there is an issue with the nodes of this cluster causing network timeout issues. Look at your metrics/monitoring for CPU saturation, disk IO saturation, etc. There could be something else causing the node to be unable to process network packets and thus miss heartbeats and be ejected from the cluster.
  • DBA100DBA100 Current User Role Patron
    "be unable to process network packets and thus miss heartbeats and be ejected from the cluster."

    so you are sure that MUST BE network problem ! and why port 4567 ? I never use it for mysql
  • matthewbmatthewb Senior [email protected] Percona Staff Role
    No, I am not sure it is directly network related. The logs say "connection timed out" which means network issues. However, many other things can cause "network issues."

    4444 is used by PXC/Galera for SST/IST. 4567 is used by PXC/Galera internal node-node communication. 3306 is used by MySQL.
  • DBA100DBA100 Current User Role Patron
    "4444 is used by PXC/Galera for SST/IST. 4567 is used by PXC/Galera internal node-node communication. 3306 is used by MySQL."

    good ! and can I just telnet <nodes IP> 4444 and telnet <nodes IP> 4567 to verify it ?  I am thinking firewall block it.

  • matthewbmatthewb Senior [email protected] Percona Staff Role
    Probably, yes, you can telnet to 4567 to see if you get a response from another node. 4444 only responds while an SST/IST is in progress.
  • DBA100DBA100 Current User Role Patron
    " SST/IST is in progress"

    replication ?
  • matthewbmatthewb Senior [email protected] Percona Staff Role
    No. You need to go learn PXC Basics 101 if you don't know what IST/SST are. Fundamental to PXC/Galera operations.
  • DBA100DBA100 Current User Role Patron
    edited August 27
    yeah!I might forget it, it is for the start of replication and the stream replication, right?

    at this moment want to troubleshoot the cluster first. sorry
  • matthewbmatthewb Senior [email protected] Percona Staff Role
    No. IST/SST have nothing to do with replication. SST is for when new nodes join. IST is for when nodes leave and then come back.
  • DBA100DBA100 Current User Role Patron
    " SST is for when new nodes join. IST is for when nodes leave and then come back."
    good and tks. will have a look later.

  • DBA100DBA100 Current User Role Patron
    edited August 27
    hi,
    probably some network problem and today very funny that, the cluster up without any problem anymore without I restart /bootstrap again ! 
    amazing ..
    one quetion is, if next time it happens again, just because of some network problem, the linkage between nodes  broken again by some reason but it recover later, will the cluster also reform automatically ? 

    and I found if situation like this happen again, really need to bootstrap again even it recover itself automatically.
  • matthewbmatthewb Senior [email protected] Percona Staff Role
    If all nodes are down, you must always bootstrap the first node.
Sign In or Register to comment.

MySQL, InnoDB, MariaDB and MongoDB are trademarks of their respective owners.
Copyright ©2005 - 2020 Percona LLC. All rights reserved.