Garbd cannot join cluster?

Hey everyone,

I am pretty new to database administration and started playing with MariaDB + galera one.
First I manually made a standard 3 node cluster which was working great, replication was working fine.

Then I tried to create a 2 node + garbd environment like this:

  • gallera cluster: 2 node mariadb 10.5 + garbd 4.7. All nodes are on centos 7.9 (latest), selinux/firewall off.

  • node 1: 10.10.38.13

  • node 2: 10.10.38.14

  • garb : 10.10.38.11

Those are all deployed via Cluster control (SSL is enabled etc…).

Nodes are synced but garb is reporting:

  • connection time out and cannot join the cluster

  • no nodes coming from prim view, prim not possible.

My knowledge in mysql clusters is basic and this is test/learning environment so I may be missing something.
I tried to pulling the SSL certifacts/keys/CAs and launching the garbd manually but the result was the same.

on the nodes logs /var/log/mysql/mysqd.log there is no sign of connections attempts.

On the witness:

/etc/garb.conf

address = gcomm://10.10.38.13:4567,10.10.38.14:4567
group = TEST-CL
options = gmcast.listen_addr=tcp://0.0.0.0:4567;socket.ssl_cert=/etc/mysql/certs/galera_rep.crt;socket.ssl_key=/etc/mysql/certs/galera_rep.key;socket.ssl_cipher=AES128-SHA
log = /var/log/garbd.log

Here is garbd log:

2021-04-09 05:33:35.013 INFO: Read config:
daemon: 1
name: garb
address: gcomm://10.10.38.13:4567,10.10.38.14:4567
group: TEST-CL
sst: trivial
donor:
options: gmcast.listen_addr=tcp://0.0.0.0:4567;socket.ssl_cert=/etc/mysql/certs/galera_rep.crt;socket.ssl_key=/etc/mysql/certs/galera_rep.key;socket.ssl_cipher=AES128-SHA; gcs.fc_limit=9999999; gcs.fc_factor=1.0; gcs.fc_master_slave=yes
cfg: /etc/garbd.cnf
log: /var/log/garbd.log

2021-04-09 05:33:35.017 INFO: protonet asio version 0
2021-04-09 05:33:35.017 INFO: Using CRC-32C for message checksums.
2021-04-09 05:33:35.017 INFO: backend: asio
2021-04-09 05:33:35.018 INFO: gcomm thread scheduling priority set to other:0
2021-04-09 05:33:35.018 WARN: access file(./gvwstate.dat) failed(No such file or directory)
2021-04-09 05:33:35.018 INFO: restore pc from disk failed
2021-04-09 05:33:35.018 INFO: GMCast version 0
2021-04-09 05:33:35.018 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) listening at tcp://0.0.0.0:4567
2021-04-09 05:33:35.018 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) multicast: , ttl: 1
2021-04-09 05:33:35.018 INFO: EVS version 1
2021-04-09 05:33:35.018 INFO: gcomm: connecting to group ‘TEST-CL’, peer ‘10.10.38.13:4567,10.10.38.14:4567’
2021-04-09 05:33:38.020 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) connection to peer 00000000-0000 with addr tcp://10.10.38.13:4567 timed out, no messages seen in PT3S, socket stats: rtt: 618 rttvar: 309 rto: 201000 lost: 0 last_data_recv: 49249008 cwnd: 10 last_queued_since: 3000295346 last_delivered_since: 3000295346 send_queue_length: 0 send_queue_bytes: 0
2021-04-09 05:33:38.020 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) connection to peer 00000000-0000 with addr tcp://10.10.38.14:4567 timed out, no messages seen in PT3S, socket stats: rtt: 414 rttvar: 207 rto: 200000 lost: 0 last_data_recv: 49249009 cwnd: 10 last_queued_since: 3000732789 last_delivered_since: 3000732789 send_queue_length: 0 send_queue_bytes: 0
2021-04-09 05:33:38.020 INFO: EVS version upgrade 0 → 1
2021-04-09 05:33:38.021 INFO: PC protocol upgrade 0 → 1
2021-04-09 05:33:38.021 WARN: no nodes coming from prim view, prim not possible
2021-04-09 05:33:38.021 INFO: view(view_id(NON_PRIM,5d4c57ba-ba6b,1) memb {
5d4c57ba-ba6b,0
} joined {
} left {
} partitioned {
})
2021-04-09 05:33:38.521 WARN: last inactive check more than PT1.5S ago (PT3.50243S), skipping check
2021-04-09 05:33:42.021 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) connection to peer 00000000-0000 with addr tcp://10.10.38.13:4567 timed out, no messages seen in PT3S, socket stats: rtt: 654 rttvar: 327 rto: 200000 lost: 0 last_data_recv: 49253009 cwnd: 10 last_queued_since: 2999647151 last_delivered_since: 2999647151 send_queue_length: 0 send_queue_bytes: 0
2021-04-09 05:33:45.022 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) connection to peer 00000000-0000 with addr tcp://10.10.38.14:4567 timed out, no messages seen in PT3S, socket stats: rtt: 569 rttvar: 284 rto: 201000 lost: 0 last_data_recv: 49256010 cwnd: 10 last_queued_since: 2999626360 last_delivered_since: 2999626360 send_queue_length: 0 send_queue_bytes: 0
2021-04-09 05:33:48.027 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) connection to peer 00000000-0000 with addr tcp://10.10.38.13:4567 timed out, no messages seen in PT3S, socket stats: rtt: 567 rttvar: 283 rto: 200000 lost: 0 last_data_recv: 49259016 cwnd: 10 last_queued_since: 3004261735 last_delivered_since: 3004261735 send_queue_length: 0 send_queue_bytes: 0
2021-04-09 05:33:51.526 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) connection to peer 00000000-0000 with addr tcp://10.10.38.14:4567 timed out, no messages seen in PT3S, socket stats: rtt: 428 rttvar: 214 rto: 200000 lost: 0 last_data_recv: 49262515 cwnd: 10 last_queued_since: 3497745923 last_delivered_since: 3497745923 send_queue_length: 0 send_queue_bytes: 0
2021-04-09 05:33:54.527 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) connection to peer 00000000-0000 with addr tcp://10.10.38.13:4567 timed out, no messages seen in PT3S, socket stats: rtt: 607 rttvar: 303 rto: 200000 lost: 0 last_data_recv: 49265516 cwnd: 10 last_queued_since: 2999352449 last_delivered_since: 2999352449 send_queue_length: 0 send_queue_bytes: 0
2021-04-09 05:33:57.528 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) connection to peer 00000000-0000 with addr tcp://10.10.38.14:4567 timed out, no messages seen in PT3S, socket stats: rtt: 422 rttvar: 211 rto: 200000 lost: 0 last_data_recv: 49268516 cwnd: 10 last_queued_since: 2999620548 last_delivered_since: 2999620548 send_queue_length: 0 send_queue_bytes: 0
2021-04-09 05:34:00.529 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) connection to peer 00000000-0000 with addr tcp://10.10.38.13:4567 timed out, no messages seen in PT3S, socket stats: rtt: 301 rttvar: 150 rto: 200000 lost: 0 last_data_recv: 49271518 cwnd: 10 last_queued_since: 3000628968 last_delivered_since: 3000628968 send_queue_length: 0 send_queue_bytes: 0
2021-04-09 05:34:03.530 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) connection to peer 00000000-0000 with addr tcp://10.10.38.14:4567 timed out, no messages seen in PT3S, socket stats: rtt: 356 rttvar: 178 rto: 200000 lost: 0 last_data_recv: 49274519 cwnd: 10 last_queued_since: 2999730135 last_delivered_since: 2999730135 send_queue_length: 0 send_queue_bytes: 0
2021-04-09 05:34:06.531 INFO: (5d4c57ba-ba6b, ‘tcp://0.0.0.0:4567’) connection to peer 00000000-0000 with addr tcp://10.10.38.13:4567 timed out, no messages seen in PT3S, socket stats: rtt: 502 rttvar: 251 rto: 200000 lost: 0 last_data_recv: 49277520 cwnd: 10 last_queued_since: 2999591262 last_delivered_since: 2999591262 send_queue_length: 0 send_queue_bytes: 0
2021-04-09 05:34:08.037 INFO: PC protocol downgrade 1 → 0
2021-04-09 05:34:08.037 INFO: view((empty))
2021-04-09 05:34:08.038 ERROR: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
at /home/buildbot/buildbot/build/gcomm/src/pc.cpp:connect():160
2021-04-09 05:34:08.038 ERROR: /home/buildbot/buildbot/build/gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)
2021-04-09 05:34:08.038 ERROR: /home/buildbot/buildbot/build/gcs/src/gcs.cpp:gcs_open():1632: Failed to open channel ‘TEST-CL’ at ‘gcomm://10.10.38.13:4567,10.10.38.14:4567’: -110 (Connection timed out)
2021-04-09 05:34:08.038 INFO: Shifting CLOSED → DESTROYED (TO: 0)
2021-04-09 05:34:08.039 FATAL: Exception in creating receive loop: Failed to open connection to group: 110 (Connection timed out)
at /home/buildbot/buildbot/build/garb/garb_gcs.cpp:Gcs():35

I also tried to create a manual GARBD deployment with no SSL encryption in replication on the databases ( so no need to define garbd certificate/keys) but result was still the same.

I tried doing this on mariadb 10.4 because I thought it may the version that has the issue but unfortunately the result was the same.

Any advice is appreciated.

1 Like

Are both nodes (.13 and .14) already online and bootstrapped? The garbd messages clearly state that connections to .13 and .14 are timing out. So either A) firewall is not really off, or B) mysql/galera isn’t running on .13 and .14

1 Like

Firewall is completely off
Galera port 4567 is at listening state on both 13 and 14.

Also when I added a third node on .15 it joined without any issue.
So currently I decided to go with standard 3 cluster node but I am still interested why is this behaviour happening.

1 Like

If you have SSL enabled on your cluster, then your SSL config for garbd is not correct. It shows timeout on “tcp://” and not “ssl://” which is what you should be seeing if properly configured SSL. Here is how we use garbd in our official PXC training:

garbd -a gcomm://mysql1,mysql3 -g mycluster \
  --option="socket.ssl_key=/etc/ssl/mysql/server-key.pem; \
  socket.ssl_cert=/etc/ssl/mysql/server-cert.pem; \
  socket.ssl_ca=/etc/ssl/mysql/ca.pem; \
  socket.ssl_cipher=AES128-SHA256"
3 Likes

Thank you Matthewb!

I will test it next week but I am pretty sure what you say is the issue with my config.
Cheers!

1 Like

Hi,

I had the very same problem: Ubuntu 20.04 with MariaDB 10.5 on 2 nodes and Garb on third.

The problem was in fact the garb binary from MariaDB repository: it seems to have SSL deactivated (even if the binary is linked to OpenSSL). All socket.ssl parameters are ignored.

I just had to remove MariaDB repository and install arbitrator from Ubuntu and it worked.

Failing one (from MariaDB): 26.4.8, working one (from Ubuntu): 26.4.3-4.

Regards

2 Likes