In the 5 node cluster structure, percona mysql works as master-master as database production environment in version 8.0.29. 3 nodes are located on dc1 and 2 nodes are located on dc2. I exported bootstrap from one of the nodes on dc1 and it was working uninterrupted for a long time. All nodes on dc1 in the past time period were down. When I looked at the grastate.dat file, the seq_number of the nodes on dc2 was visible ahead.
When I examine the logs, I see that the servers that are shutting down receive the following error.
“2023-01-18T21:05:58.869367+03:00 0 [Note] [MY-000000] [Galera] (69066974-8695, ‘ssl://0.0.0.0:4567’) connection to peer f77bafb6-824e with addr ssl://172.19.0.158:4567 timed out, no messages seen in PT3S, socket stats: rtt: 6030 rttvar: 10758 rto: 208000 lost: 0 last_data_recv: 3080 cwnd: 8 last_queued_since: 259064144 last_delivered_quetes_send_since: 30 : 0 segment: 0 messages: 0 segment: 1 messages: 0 (gmcast.peer_timeout)”
“2023-01-19T19:06:04.783005+03:00 0 [Warning] [MY-000000] [Galera] Member 4.0 (prd-mysql8-02) requested state transfer from ‘any’, but it is impossible to select State Transfer donor: Resource temporarily unavailable”
“2023-01-19T19:08:15.969783+03:00 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 → 0
2023-01-19T19:08:15.969804+03:00 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
2023-01-19T19:08:15.978155+03:00 0 [Note] [MY-000000] [Galera] Deferred close timer started for socket with remote endpoint: ssl://172.20.0.156:4567
2023-01-19T19:08:15.980612+03:00 0 [Note] [MY-000000] [Galera] gcomm: closed
2023-01-19T19:08:15.980645+03:00 0 [Note] [MY-000000] [Galera] /usr/sbin/mysqld: Terminated.
2023-01-19T19:08:15.980654+03:00 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation”
First of all, I would like your comments about the cause of this error.
When I wanted to start the databases on dc1, the databases on dc1 were not opened even though it was included in the cluster.
Related errorlog added
Error.log (23.2 KB)
Since I could not re-open, I closed all nodes by risking data loss. bootstrap from the first server on dc1 and brought all databases on dc1 side up. I went to DC2 and deleted the data_directory and started the db’s on the two servers here.
I would like your help on the cause of this problem and its not happening again.
By the way, I monitor my databases with pmm. I installed it with the advice you gave earlier. However, I couldn’t find where I can get the data at the time of the related problem.