Cluster behavior with high packet loss

zucon · October 9, 2013, 2:50am

Hi,

our production cluster had a problem with a node that suffered high packet loss. In the end all nodes where in the state ‘Initialized’ and therefore they won’t usable any more.

It was easy to duplicate the problem with my test cluster. I had 4 properly connected nodes. I added some packet loss (30%, 50%, 65%, 80%, then 30% again, 2 minutes each) to one node. With 80% loss there was no real problem, since the node has been considered to be unreachable and the three remaining nodes worked properly. But with 50/65% loss the cluster tried it over and over again to connect to the node with the high packet loss. During this time, the remaining nodes stay SYNCED, but write access is blocked quite often for a few seconds.

The strange thing is, that eventually the remaining nodes lose the SYNCED status. I’ve tried that several times. It always ends with wsrep_local_state_comment = Initialized and different wsrep_incoming_addresses values (like: node ip, node ip + undefined for the other nodes or completely empty), probably depending on how long the cluster is in the Initialized state.

My galera settings are:

base_host = 10.112.11.10; base_port = 4567; cert.log_conflicts = no; evs.causal_keepalive_period = PT3S; evs.debug_log_mask = 0x1; evs.inactive_check_period = PT10S; evs.inactive_timeout = PT1M; evs.info_log_mask = 0; evs.install_timeout = PT1M; evs.join_retrans_period = PT5S; evs.keepalive_period = PT3S; evs.max_install_timeouts = 1; evs.send_window = 512; evs.stats_report_period = PT1M; evs.suspect_timeout = PT30S; evs.use_aggregate = true; evs.user_send_window = 512; evs.version = 0; evs.view_forget_timeout = PT5M; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 2048M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; gmcast.listen_addr = ssl://0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT10S; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = 10.112.11.10; pc.checksum = true; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT20S; pc.npvo = false; pc.version = 0; pc.weight = 1; protonet.backend = asio; protonet.version = 0; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3; socket.ssl = YES; socket.ssl_ca = /etc/percona/ssl/server.crt; socket.ssl_cert = /etc/percona/ssl/server.crt; socket.ssl_cipher = AES128-SHA; socket.ssl_compression = YES; socket.ssl_key = /etc/percona/ssl/server.key

I increased some timeouts due to the high latency WAN connections. I assume that with lower timeouts, lower packet loss would cause similar problems. So, the question is, why does the cluster turns the Initialized state in the first place? I assume that lowering the reconnect retries/timeouts would improve it, but won’t completly avoid it.

The server log files:
[URL]https://fex.zu-con.org/fop/p8xyA34y/issue-db1.err.txt[/URL]
[URL]https://fex.zu-con.org/fop/a1eNbAIW/issue-db2.err.txt[/URL] (node with packet loss, it also reconnect at the end, after the loss is disabled)
[URL]https://fex.zu-con.org/fop/2QKKqco1/issue-db3.err.txt[/URL]
[URL]https://fex.zu-con.org/fop/6NCs0J09/issue-db4.err.txt[/URL]

Regards,
Matthias

Topic		Replies	Views
Cluster down with 1/3 node down Percona XtraDB Cluster 5.x	4	1633	February 11, 2014
Issue with replicated xtradb cluster accross multiple datacenters Percona XtraDB Cluster 8.x	2	319	January 2, 2024
Cluster always sync with the same Donor Percona XtraDB Cluster 5.x	3	849	March 15, 2016
Can't connect to cluster after wsrep error - xtrabackup_checkpoints missing Percona XtraDB Cluster 8.x	4	4179	October 13, 2022
Once a day lockup of my percona Galera cluster. Percona XtraDB Cluster 5.x	2	601	January 7, 2015

Cluster behavior with high packet loss

Related topics