Database outage problem

bthnklc · December 21, 2022, 7:56am

Hi,

There is a very important matter that I would like to consult with you. I need to tell you about our current database system. There is xtradbcluster (8.0.29) located on DC1 with 3 nodes. There are also application servers on DC1. Similarly, the equivalent of the application servers are located on DC2. Here’s what I want to do and what I’m doing. I installed 2 node database servers on DC2 and included them in the cluster. At the end of the process, I had a cluster in a structure with a total of 5 nodes. My purpose here is to run the application servers from DC2 and ensure sustainability, since the conjugate data is also on DC2 in case something happens to DC1. However, the situation I’ve been living for 2 days unfortunately caused interruptions. In the structure with 5 nodes, the database is closing on a single node, the reason for which I can not find. When I open the database, I see that it does not participate in replication because it has changed to “wsrep_status = non-primary” and “wsrep_ready=OFF”. When I close the 2 databases that I have installed on DC2 later, the problem goes away. A similar situation occurs after 3-4 hours when I turn it on DC2 nodes. We had an outage last night, I’m sharing the error.log about it. I would like your suggestions and directions.
error.log (135.6 KB)

matthewb · December 22, 2022, 4:14am

Operation CREATE USER failed for 'Reportuser'@'172.19.0.158'

There is inconsistencies between DC1 and DC2. I suggest you erase DC2 nodes and re-create them. Just stop mysql, erase the $datadir, and let node1 in DC2 perform a full SST from DC1.

bthnklc · December 22, 2022, 1:33pm

This user creation error, which appears in the log, is a warning received after my faulty operations after turning off the dc2s after the problem. It is not related to WSREP. First of all, I need to point this out. You suggested that the nodes on the dc2 should be reinstalled for the main problem we had. Did I get right? I’m asking to confirm. How did you come to the conclusion that the SST was not complete?

matthewb · December 22, 2022, 1:39pm

I read your logs. The user creation error caused both your DC2 nodes to fail. It is not a warning, it is an error. This error caused a transaction conflict and a subsequent quorum vote which evicted the two DC2 nodes. (error.log#L487) In my opinion, this is an unstable cluster and you should erase both DC2 nodes and let them redo the SST process to get a proper dataset.

What is the lag time between DC1 and DC2? Your cluster will only be as fast as the slowest link between DC1 and DC2.

Also, look at gmcast.segment parameter for optimized DC<->DC network operations.

bthnklc · December 22, 2022, 2:21pm

Thank you very much for your reply and review @matthewb I don’t understand what you mean by lag time. Speed or some other position between dc1 and dc2? I’m asking in order to give you the right answer. Also, how is the gmcast.segment parameter value calculated?

matthewb · December 22, 2022, 4:45pm

https://www.google.com/search?q=network+lag

Also, how is the gmcast.segment parameter value calculated?

Set all 3 nodes in DC1 to use 0, and set the 2 nodes in DC to to use 1

bthnklc · December 22, 2022, 4:52pm

comment and editing on this subject. i will be looking with the network admin.

Thank you for your feedback.
This information is very valuable to me.

Topic		Replies	Views
The problem of unconsist close and not opening databases on DC1 Percona XtraDB Cluster 8.x mysql , percona	3	999	January 24, 2023
Percona if down two node Percona XtraDB Cluster 8.x	6	425	February 7, 2024
1 node out of sync and 1 node down out of 3 Node in MYSQL Percona Cluster Percona XtraDB Cluster 5.x	12	1716	July 27, 2023
Database cluster hang problem Percona XtraDB Cluster 8.x mysql , percona	5	1246	March 14, 2023
Application DR test Percona XtraDB Cluster 8.x mysql , percona	3	776	January 26, 2023

Database outage problem

Related topics