Random apparent reset: help deciphering

Hi,

I’m new to the use of clusters, but I’ve been having some success. However, every few days, the whole cluster seems to reset/restart and I don’t know why. Scripts that are connected to any of the MySQL DBs die as a result.

I’ve attached the log from one of the servers in the cluster for one of these particular events.

In the log I changed some IP addresses for some basic obfuscation:
X.X.XXX.XXX and Y.Y.YY.YY are real DB servers with identical setup. ZZ.ZZ.ZZ.ZZ is garb.

The attached files are logs from each server.

Can someone help me decipher these logs and work out why this is happening?

Thanks!

mysql_logs.zip (5.84 KB)

Hi,

Do you have direct network connection between each node? It seems like when X or Y lost connection to Z, both could not see anything but itself (cluster size 1). Z can work as a relay between X and Y but that should be only temporary workaround, not a normal situation.

Hi,

Thanks for your reply. Yes, X & Y are in the same data centre. Z is located outside on AWS.

Do you think they lost connection between each other? Shouldn’t the local MySQL server stay up in this case? It seems it goes down.

This is weird, if X and Y are in the same datacenter, loosing connection to Z would not cause problems as still the quorum would be saved ( more then 50% of the nodes operational in the cluster). But in this case the nodes could not connect to any other node, that’s why MySQL interface had to be suspended as there was no quorum.
Apart from network issues there could be serious server overload like heavy swapping which could also result in missing/delayed communication between the nodes.