Database outage problem

Hi,

There is a very important matter that I would like to consult with you. I need to tell you about our current database system. There is xtradbcluster (8.0.29) located on DC1 with 3 nodes. There are also application servers on DC1. Similarly, the equivalent of the application servers are located on DC2. Here’s what I want to do and what I’m doing. I installed 2 node database servers on DC2 and included them in the cluster. At the end of the process, I had a cluster in a structure with a total of 5 nodes. My purpose here is to run the application servers from DC2 and ensure sustainability, since the conjugate data is also on DC2 in case something happens to DC1. However, the situation I’ve been living for 2 days unfortunately caused interruptions. In the structure with 5 nodes, the database is closing on a single node, the reason for which I can not find. When I open the database, I see that it does not participate in replication because it has changed to “wsrep_status = non-primary” and “wsrep_ready=OFF”. When I close the 2 databases that I have installed on DC2 later, the problem goes away. A similar situation occurs after 3-4 hours when I turn it on DC2 nodes. We had an outage last night, I’m sharing the error.log about it. I would like your suggestions and directions.
error.log (135.6 KB)

1 Like
Operation CREATE USER failed for 'Reportuser'@'172.19.0.158'

There is inconsistencies between DC1 and DC2. I suggest you erase DC2 nodes and re-create them. Just stop mysql, erase the $datadir, and let node1 in DC2 perform a full SST from DC1.

1 Like

This user creation error, which appears in the log, is a warning received after my faulty operations after turning off the dc2s after the problem. It is not related to WSREP. First of all, I need to point this out. You suggested that the nodes on the dc2 should be reinstalled for the main problem we had. Did I get right? I’m asking to confirm. How did you come to the conclusion that the SST was not complete?

1 Like

I read your logs. The user creation error caused both your DC2 nodes to fail. It is not a warning, it is an error. This error caused a transaction conflict and a subsequent quorum vote which evicted the two DC2 nodes. (error.log#L487) In my opinion, this is an unstable cluster and you should erase both DC2 nodes and let them redo the SST process to get a proper dataset.

What is the lag time between DC1 and DC2? Your cluster will only be as fast as the slowest link between DC1 and DC2.

Also, look at gmcast.segment parameter for optimized DC<->DC network operations.

1 Like

Thank you very much for your reply and review @matthewb I don’t understand what you mean by lag time. Speed or some other position between dc1 and dc2? I’m asking in order to give you the right answer. Also, how is the gmcast.segment parameter value calculated?

1 Like

https://www.google.com/search?q=network+lag

Also, how is the gmcast.segment parameter value calculated?

Set all 3 nodes in DC1 to use 0, and set the 2 nodes in DC to to use 1

1 Like

comment and editing on this subject. i will be looking with the network admin.

Thank you for your feedback.
This information is very valuable to me.

1 Like