Latency limit for WAN 'Cluster'

Hi,

I’ve a test setup with 3 VMs called [galera1|2|3] with the following network delays between the VMs:

[galera1] – <700ms delay> – [galera3]
[galera2] – <1ms delay> – [galera3]

The galera relevant settings are:

wsrep_cluster_address=gcomm://192.168.122.11,192.168.122.12
wsrep_provider=/usr/lib/libgalera_smm.so
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
innodb_locks_unsafe_for_binlog=1
innodb_flush_log_at_trx_commit=0

wsrep_sst_method=rsync_wan
wsrep_sst_receive_address=192.168.122.10:4000
wsrep_node_incoming_address=192.168.122.10

When the cluster is running, I’ve no problems, even with the delay, except the degraded speed. When I shutdown galera2 and galera1 gracefully, I can use galera3, the cluster status is Primary. But when I start galera1, I get the following error:

WSREP: Assign initial position for certification: -1, protocol version: -1
130122 16:43:04 [Note] WSREP: wsrep_sst_grab()
130122 16:43:04 [Note] WSREP: Start replication
130122 16:43:04 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
130122 16:43:04 [Note] WSREP: protonet asio version 0
130122 16:43:04 [Note] WSREP: backend: asio
130122 16:43:04 [Note] WSREP: GMCast version 0
130122 16:43:04 [Note] WSREP: (68d38005-64aa-11e2-0800-5b88e4584e10, ‘tcp://0.0.0.0:4567’) listening at tcp://0.0.0.0:4567
130122 16:43:04 [Note] WSREP: (68d38005-64aa-11e2-0800-5b88e4584e10, ‘tcp://0.0.0.0:4567’) multicast: , ttl: 1
130122 16:43:04 [Note] WSREP: EVS version 0
130122 16:43:04 [Note] WSREP: PC version 0
130122 16:43:04 [Note] WSREP: gcomm: connecting to group ‘my_wsrep_cluster’, peer ‘192.168.122.11:,192.168.122.12:’
130122 16:43:05 [Note] WSREP: (68d38005-64aa-11e2-0800-5b88e4584e10, ‘tcp://0.0.0.0:4567’) cleaning up duplicate 0x20a0760 after established 0x208f980
130122 16:43:36 [Note] WSREP: view((empty))
130122 16:43:36 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():157
130122 16:43:36 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: Failed to open backend connection: -110 (Connection timed out)
130122 16:43:36 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel ‘my_wsrep_cluster’ at ‘gcomm://192.168.122.11,192.168.122.12’: -110 (Connection timed out)
130122 16:43:36 [ERROR] WSREP: gcs connect failed: Connection timed out
130122 16:43:36 [ERROR] WSREP: wsrep::connect() failed: 6
130122 16:43:36 [ERROR] Aborting

130122 16:43:36 [Note] WSREP: Service disconnected.
130122 16:43:37 [Note] WSREP: Some threads may fail to exit.
130122 16:43:37 [Note] /usr/sbin/mysqld: Shutdown complete

It seems that the node hit a PT30S timeout. I tried to raise all 30 seconds timeouts and applied the WAN settings from the galera wiki, but the problem persists. There is also traffic on port 4567 till the node claims that there’s a timeout.

Matthias

Guess I found it:

is set to PT0.3S, but the galera wiki claims that the default value is PT1S. However, setting the value to higher values makes the cluster joins more reliable.

Matthias