Joining cluster fails because of SST timeout

Dree · July 31, 2017, 12:50am

I’m running into the same problem as this topic:
[url]https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster/44077-xtradb-cluster-keeps-failing-to-join-cluster[/url]

After 9000 seconds SST is stopped, so nodes can no longer join my cluster, as it’s too big now to complete in time.
Is there a fix for this yet?

bdelmedico · August 2, 2017, 12:27pm

Man, change the systemd timeout

Dree · August 3, 2017, 12:30am

It’s not the systemd timeout, it’s a timeout in SST.

bdelmedico · August 3, 2017, 7:32am

Kkk I’ll tell you a little about my replication problems … hope it helps …

I had a lot of timeout problem …

1 - systemd timeout in mysql start
2 - I discovered with the help of Rene that the next bottleneck was my firewall that was generating timeout when it got the processor in 100%, with that it knocked down all the connections.
3 - I closed a VPC with aws
4 - timeout settings within my.cnf (wsrep_provider_options = " gcs.max_packet_size=1048576; evs.send_window=512; evs.user_send_window=512; evs.inactive_timeout = PT90S; evs.suspect_timeout = PT30S; evs.install_timeout = PT60S; evs.keepalive_period = PT6S; evs.max_install_timeouts = 8 ")
5 - memory confguration problems in joiner server my.cnf
6 - to run without crashes I upgraded the insternet link from 10Mb to 50Mb.

I think that was all that … kkkk but solved my problems … today my bank of 80G takes 240 minutes to replicate everything, this nor generate any line of warning in the logs.

Besides that I did tuning the operating system.
net.core.somaxconn = 1024
net.core.netdev_max_backlog = 5000
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_wmem = 4096 12582912 16777216
net.ipv4.tcp_rmem = 4096 12582912 16777216
net.ipv4.tcp_max_syn_backlog = 8096
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 10240 65535

fs.file-max=200000
kernel.sem=250 32000 100 1024
kernel.shmmax=4294967295

net.ipv4.tcp_retries2 = 2

#net.ipv4.tcp_syn_retries = 0
net.ipv4.tcp_synack_retries = 0

net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 2

vm.swappiness = 0
vm.dirty_ratio = 80
vm.dirty_background_ratio = 5
vm.dirty_expire_centisecs = 12000

Dree · August 3, 2017, 8:20am

I just switched to MariaDB Galera Cluster, which doesn’t seem to have this timeout. It’s working fine on that.

Thanks for your very detailed answer though!

Topic		Replies	Views
Node fails to join Percona XtraDB Cluster 5.x	4	1496	February 1, 2019
SST fails when trying to join cluster Percona XtraDB Cluster 5.x	1	651	November 9, 2020
SST Failure XtraDB Cluster 5.6.28 Percona XtraDB Cluster 5.x	1	640	November 9, 2016
SST dies after 120 second Stall after transferring 300+GB on AWS only Percona XtraDB Cluster 5.x mysql	3	1793	June 28, 2023
Cluster failed, can you provide any insight? Percona XtraDB Cluster 5.x	11	1201	November 8, 2013

Joining cluster fails because of SST timeout

Related topics