connection to peer ... with addr tcp://...:4567 timed out, no messages seen in PT3S

Hi. I have an ansible playbook that can take 3 servers running ubuntu and turn them into a 3 node galera cluster. I’m adapting it to work in my virtual environment and I cannot for the life of me get my galera cluster to work in this environment.

I’ve tried percona version 5.6 and 5.7 and they both result in seemingly the same problem.

I bootstrap the first node (using /etc/init.d/mysql bootstrap-pxc) fine. Then I go to start the second node and it ends up timing out:

[....] Starting mysql (via systemctl): mysql.serviceJob for mysql.service failed because a timeout was exceeded. See "systemctl status mysql.service" and "journalctl -xe" for details.

The log that stands out is:

2017-07-20T20:04:51.735257Z 0 [Note] WSREP: (aeb66534, 'tcp://0.0.0.0:4567') connection to peer aeb66534 with addr tcp://172.20.1.10:4567 timed out, no messages seen in PT3S

In this case the IP 172.20.1.10 is actually the second node… So it looks like it’s trying to connect with itself and then failing???

What’s worse is that even though the service failed to start I’m left with a whole bunch of running mysql processes:

root 18011 0.0 0.0 4512 1836 ? S 20:04 0:00 /bin/sh /usr/bin/mysqld_safe
mysql 18432 0.1 8.8 801844 181392 ? Sl 20:04 0:00 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --wsrep-provider=/usr/lib/galera3/libgalera_smm.so --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --wsrep_start_position=2a208cfd-6cac-11e7-a457-ef721d184774:0
mysql 18441 0.0 0.0 4512 800 ? S 20:04 0:00 sh -c wsrep_sst_rsync --role 'joiner' --address '172.20.1.10' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '18432' '' 
mysql 18442 0.2 0.1 19912 3612 ? S 20:04 0:00 /bin/bash -ue /usr/bin/wsrep_sst_rsync --role joiner --address 172.20.1.10 --datadir /var/lib/mysql/ --defaults-file /etc/mysql/my.cnf --defaults-group-suffix --parent 18432
mysql 18526 0.0 0.0 12776 980 ? S 20:04 0:00 rsync --daemon --no-detach --port 4444 --config /var/lib/mysql//rsync_sst.conf
mysql 18549 0.0 0.1 26516 2896 ? S 20:04 0:00 rsync --daemon --no-detach --port 4444 --config /var/lib/mysql//rsync_sst.conf
mysql 18550 0.0 0.1 26516 2896 ? S 20:04 0:00 rsync --daemon --no-detach --port 4444 --config /var/lib/mysql//rsync_sst.conf
mysql 18551 0.0 0.0 26516 348 ? S 20:04 0:00 rsync --daemon --no-detach --port 4444 --config /var/lib/mysql//rsync_sst.conf
mysql 18552 0.0 0.0 26516 348 ? S 20:04 0:00 rsync --daemon --no-detach --port 4444 --config /var/lib/mysql//rsync_sst.conf
mysql 20018 0.0 0.0 6016 684 ? S 20:10 0:00 sleep 1

I’ve checked connectivity between the nodes… it’s good.

I don’t know what else to check. I tried bumping the connection timeout from 3 seconds to 30 seconds and that didn’t change anything… so it’s just plain broken.

Is there anything I can do to increase the verbosity of the debug messages?

Thank you in advance.

Log of node2:
http://paste.ubuntu.com/25135077/

Log of node1 (the bootstrapped node):
http://paste.ubuntu.com/25135129/
Around this timeframe: 2017-07-20 20:47

User jwh on the #percona IRC channel helped me discover my problem. The interfaces I was using for galera had an MTU of 9000. Even though this MTU is correct (and the overlay network is configured correctly) for some reason percona gcomm messages seem to get stuck. I’m not sure if this is a bug or functions as designed.
For my usecase I switched to MTU 1500 and everything works great.