our production cluster had a problem with a node that suffered high packet loss. In the end all nodes where in the state ‘Initialized’ and therefore they won’t usable any more.
It was easy to duplicate the problem with my test cluster. I had 4 properly connected nodes. I added some packet loss (30%, 50%, 65%, 80%, then 30% again, 2 minutes each) to one node. With 80% loss there was no real problem, since the node has been considered to be unreachable and the three remaining nodes worked properly. But with 50/65% loss the cluster tried it over and over again to connect to the node with the high packet loss. During this time, the remaining nodes stay SYNCED, but write access is blocked quite often for a few seconds.
The strange thing is, that eventually the remaining nodes lose the SYNCED status. I’ve tried that several times. It always ends with wsrep_local_state_comment = Initialized and different wsrep_incoming_addresses values (like: node ip, node ip + undefined for the other nodes or completely empty), probably depending on how long the cluster is in the Initialized state.
My galera settings are:
base_host = 10.112.11.10; base_port = 4567; cert.log_conflicts = no; evs.causal_keepalive_period = PT3S; evs.debug_log_mask = 0x1; evs.inactive_check_period = PT10S; evs.inactive_timeout = PT1M; evs.info_log_mask = 0; evs.install_timeout = PT1M; evs.join_retrans_period = PT5S; evs.keepalive_period = PT3S; evs.max_install_timeouts = 1; evs.send_window = 512; evs.stats_report_period = PT1M; evs.suspect_timeout = PT30S; evs.use_aggregate = true; evs.user_send_window = 512; evs.version = 0; evs.view_forget_timeout = PT5M; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 2048M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; gmcast.listen_addr = ssl://0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT10S; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = 10.112.11.10; pc.checksum = true; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT20S; pc.npvo = false; pc.version = 0; pc.weight = 1; protonet.backend = asio; protonet.version = 0; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3; socket.ssl = YES; socket.ssl_ca = /etc/percona/ssl/server.crt; socket.ssl_cert = /etc/percona/ssl/server.crt; socket.ssl_cipher = AES128-SHA; socket.ssl_compression = YES; socket.ssl_key = /etc/percona/ssl/server.key
I increased some timeouts due to the high latency WAN connections. I assume that with lower timeouts, lower packet loss would cause similar problems. So, the question is, why does the cluster turns the Initialized state in the first place? I assume that lowering the reconnect retries/timeouts would improve it, but won’t completly avoid it.
The server log files:
https://fex.zu-con.org/fop/a1eNbAIW/issue-db2.err.txt (node with packet loss, it also reconnect at the end, after the loss is disabled)