The problem of unconsist close and not opening databases on DC1

In the 5 node cluster structure, percona mysql works as master-master as database production environment in version 8.0.29. 3 nodes are located on dc1 and 2 nodes are located on dc2. I exported bootstrap from one of the nodes on dc1 and it was working uninterrupted for a long time. All nodes on dc1 in the past time period were down. When I looked at the grastate.dat file, the seq_number of the nodes on dc2 was visible ahead.

When I examine the logs, I see that the servers that are shutting down receive the following error.

“2023-01-18T21:05:58.869367+03:00 0 [Note] [MY-000000] [Galera] (69066974-8695, ‘ssl://0.0.0.0:4567’) connection to peer f77bafb6-824e with addr ssl://172.19.0.158:4567 timed out, no messages seen in PT3S, socket stats: rtt: 6030 rttvar: 10758 rto: 208000 lost: 0 last_data_recv: 3080 cwnd: 8 last_queued_since: 259064144 last_delivered_quetes_send_since: 30 : 0 segment: 0 messages: 0 segment: 1 messages: 0 (gmcast.peer_timeout)”

“2023-01-19T19:06:04.783005+03:00 0 [Warning] [MY-000000] [Galera] Member 4.0 (prd-mysql8-02) requested state transfer from ‘any’, but it is impossible to select State Transfer donor: Resource temporarily unavailable”

“2023-01-19T19:08:15.969783+03:00 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 → 0
2023-01-19T19:08:15.969804+03:00 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2023-01-19T19:08:15.978155+03:00 0 [Note] [MY-000000] [Galera] Deferred close timer started for socket with remote endpoint: ssl://172.20.0.156:4567
2023-01-19T19:08:15.980612+03:00 0 [Note] [MY-000000] [Galera] gcomm: closed
2023-01-19T19:08:15.980645+03:00 0 [Note] [MY-000000] [Galera] /usr/sbin/mysqld: Terminated.
2023-01-19T19:08:15.980654+03:00 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation”

First of all, I would like your comments about the cause of this error.

When I wanted to start the databases on dc1, the databases on dc1 were not opened even though it was included in the cluster.

Related errorlog added
Error.log (23.2 KB)

Since I could not re-open, I closed all nodes by risking data loss. bootstrap from the first server on dc1 and brought all databases on dc1 side up. I went to DC2 and deleted the data_directory and started the db’s on the two servers here.

I would like your help on the cause of this problem and its not happening again.

By the way, I monitor my databases with pmm. I installed it with the advice you gave earlier. However, I couldn’t find where I can get the data at the time of the related problem.

2 Likes

Hi @bthnklc

Thank you for being part of the Percona Community.

From the errors you are reporting, there seems to be a network issue. For example, the first error tells us that one node could not connect to another host for 3 seconds. The following errors are related to the inability to find other nodes (view empty and failure to perform an SST).

What is a bit uncommon is that the two-node data center (DC2) survived being a minority. This situation makes me think it was not a communication problem between both DCs and that the issue appeared at different times. Do you know if any network maintenance operation took place when the nodes in DC1 were expelled from the cluster?

If my theory proves correct and it was a communication issue, then it should not happen again unless the communication issue repeats.

PMM has a dashboard for PXC that should give you information about when the incident happened.

Thank you!

Pep

3 Likes

Hi, @Pep_Pla

Sorry for the late reply, I’ve had some health problems for the last 3 days :frowning: As far as I know and research, no network work has been done on the systems.I will review the logs of the interruption on the network side and get back to you.

Thanks.

2 Likes

Hi @bthnklc

I hope you are feeling better now.

Pep

2 Likes