Joining cluster fails because of SST timeout

I’m running into the same problem as this topic:
[url]https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster/44077-xtradb-cluster-keeps-failing-to-join-cluster[/url]

After 9000 seconds SST is stopped, so nodes can no longer join my cluster, as it’s too big now to complete in time.
Is there a fix for this yet?

Man, change the systemd timeout

It’s not the systemd timeout, it’s a timeout in SST.

Kkk I’ll tell you a little about my replication problems … hope it helps …

I had a lot of timeout problem …

1 - systemd timeout in mysql start
2 - I discovered with the help of Rene that the next bottleneck was my firewall that was generating timeout when it got the processor in 100%, with that it knocked down all the connections.
3 - I closed a VPC with aws
4 - timeout settings within my.cnf (wsrep_provider_options = " gcs.max_packet_size=1048576; evs.send_window=512; evs.user_send_window=512; evs.inactive_timeout = PT90S; evs.suspect_timeout = PT30S; evs.install_timeout = PT60S; evs.keepalive_period = PT6S; evs.max_install_timeouts = 8 ")
5 - memory confguration problems in joiner server my.cnf
6 - to run without crashes I upgraded the insternet link from 10Mb to 50Mb.

I think that was all that … kkkk but solved my problems … today my bank of 80G takes 240 minutes to replicate everything, this nor generate any line of warning in the logs.

Besides that I did tuning the operating system.
net.core.somaxconn = 1024
net.core.netdev_max_backlog = 5000
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_wmem = 4096 12582912 16777216
net.ipv4.tcp_rmem = 4096 12582912 16777216
net.ipv4.tcp_max_syn_backlog = 8096
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 10240 65535

fs.file-max=200000
kernel.sem=250 32000 100 1024
kernel.shmmax=4294967295

net.ipv4.tcp_retries2 = 2

#net.ipv4.tcp_syn_retries = 0
net.ipv4.tcp_synack_retries = 0

net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 2

vm.swappiness = 0
vm.dirty_ratio = 80
vm.dirty_background_ratio = 5
vm.dirty_expire_centisecs = 12000

I just switched to MariaDB Galera Cluster, which doesn’t seem to have this timeout. It’s working fine on that.

Thanks for your very detailed answer though!