Garbd cannot join cluster Percona 8 with default SSL settings

Hi all,
I have a two node Persona 8 cluster up and running with default ssl settings:

# mysql -e "SHOW GLOBAL STATUS LIKE 'wsrep_cluster_%';"
+-----------------------------------+--------------------------------------------------------+
| Variable_name                   | Value                                                          |
+-----------------------------------+--------------------------------------------------------+
| wsrep_cluster_weight        | 2                                                                 |
| wsrep_cluster_capabilities |                                                                    |
| wsrep_cluster_conf_id       | 3                                                                 |
| wsrep_cluster_size            | 2                                                                  |
| wsrep_cluster_state_uuid  | b88ce12a-e660-11ec-ab48-124f00d72b1c |
| wsrep_cluster_status         | Primary                                                       |
+-----------------------------------+--------------------------------------------------------+

But when I try to add an arbitrator to the cluster I get the timeout error:
Command to start garb:

garbd --group=uat_ossvpnsuite --address="gcomm://10.1.54.64:4567,10.1.54.65:4567" --option="socket.ssl_key=/var/lib/galera/server-key.pem;socket.ssl_cert=/var/lib/gale
ra/server-cert.pem;socket.ssl_ca=/var/lib/galera/ca.pem;socket.ssl_cipher=AES128-SHA256"

Output:

2022-06-10 10:17:44.504  INFO: CRC-32C: using 64-bit x86 acceleration.                                                                                                                                              
2022-06-10 10:17:44.504  INFO: Read config:
        daemon:      0
        name:        garb
        address:     gcomm://10.1.54.64:4567,10.1.54.65:4567
        group:       uat_ossvpnsuite
        sst:         trivial
        donor:
        options:     socket.ssl_key=/var/lib/galera/server-key.pem;socket.ssl_cert=/var/lib/galera/server-cert.pem;socket.ssl_ca=/var/lib/galera/ca.pem;socket.ssl_cipher=AES128-SHA256; gcs.fc_limit=9999999; gcs.fc_factor=1.0; gcs.fc_master_slave=yes
        cfg:
        log:
        recv_script:

2022-06-10 10:17:44.505  WARN: Option 'gcs.fc_master_slave' is deprecated and will be removed in the future versions, please use 'gcs.fc_single_primary' instead.                                                  
2022-06-10 10:17:44.507  INFO: protonet asio version 0
2022-06-10 10:17:44.507  INFO: Using CRC-32C for message checksums.
2022-06-10 10:17:44.507  INFO: backend: asio
2022-06-10 10:17:44.507  INFO: gcomm thread scheduling priority set to other:0
2022-06-10 10:17:44.507  WARN: Fail to access the file (./gvwstate.dat) error (No such file or directory). It is possible if node is booting for first time or re-booting after a graceful shutdown                
2022-06-10 10:17:44.508  INFO: Restoring primary-component from disk failed. Either node is booting for first time or re-booting after a graceful shutdown                                                         
2022-06-10 10:17:44.508  INFO: GMCast version 0
2022-06-10 10:17:44.508  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2022-06-10 10:17:44.508  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2022-06-10 10:17:44.509  INFO: EVS version 1
2022-06-10 10:17:44.509  INFO: gcomm: connecting to group 'uat_ossvpnsuite', peer '10.1.54.64:4567,10.1.54.65:4567'                                                                                                
2022-06-10 10:17:47.511  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.1.54.64:4567 timed out, no messages seen in PT3S, socket stats: rtt: 186 rttvar: 93 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000607106 last_delivered_since: 3000607106 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)                                           
2022-06-10 10:17:47.511  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.1.54.65:4567 timed out, no messages seen in PT3S, socket stats: rtt: 334 rttvar: 167 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 3000793103 last_delivered_since: 3000793103 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)                                          
2022-06-10 10:17:47.511  INFO: announce period timed out (pc.announce_timeout)
2022-06-10 10:17:47.512  INFO: EVS version upgrade 0 -> 1
2022-06-10 10:17:47.512  INFO: PC protocol upgrade 0 -> 1
2022-06-10 10:17:47.512  WARN: no nodes coming from prim view, prim not possible
2022-06-10 10:17:47.512  INFO: Current view of cluster as seen by this node
view (view_id(NON_PRIM,cdf95202-abcb,1)
memb {
        cdf95202-abcb,0
        }
joined {
        }
left {
        }
partitioned {
        }
)
2022-06-10 10:17:48.012  WARN: last inactive check more than PT1.5S (3*evs.inactive_check_period) ago (PT3.50317S), skipping check                                                                                 
2022-06-10 10:17:52.012  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.1.54.64:4567 timed out, no messages seen in PT3S, socket stats: rtt: 271 rttvar: 135 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 2999936276 last_delivered_since: 2999936276 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)                                          
2022-06-10 10:17:55.512  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.1.54.65:4567 timed out, no messages seen in PT3S, socket stats: rtt: 342 rttvar: 171 rto: 200000 lost: 0 last_data_recv: 3500 cwnd: 10 last_queued_since: 3499608484 last_delivered_since: 3499608484 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)                                          
2022-06-10 10:17:58.513  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.1.54.64:4567 timed out, no messages seen in PT3S, socket stats: rtt: 180 rttvar: 90 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 2999752298 last_delivered_since: 2999752298 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)                                           
2022-06-10 10:18:02.014  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.1.54.65:4567 timed out, no messages seen in PT3S, socket stats: rtt: 318 rttvar: 159 rto: 200000 lost: 0 last_data_recv: 3500 cwnd: 10 last_queued_since: 3499663839 last_delivered_since: 3499663839 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)                                          
2022-06-10 10:18:05.014  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.1.54.64:4567 timed out, no messages seen in PT3S, socket stats: rtt: 227 rttvar: 113 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 2999996758 last_delivered_since: 2999996758 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)                                          
2022-06-10 10:18:08.015  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.1.54.65:4567 timed out, no messages seen in PT3S, socket stats: rtt: 360 rttvar: 180 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 2999639557 last_delivered_since: 2999639557 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)                                          
2022-06-10 10:18:11.016  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.1.54.64:4567 timed out, no messages seen in PT3S, socket stats: rtt: 202 rttvar: 101 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 2999724405 last_delivered_since: 2999724405 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)                                          
2022-06-10 10:18:14.516  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.1.54.65:4567 timed out, no messages seen in PT3S, socket stats: rtt: 360 rttvar: 180 rto: 200000 lost: 0 last_data_recv: 3500 cwnd: 10 last_queued_since: 3499270093 last_delivered_since: 3499270093 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)                                          
2022-06-10 10:18:17.517  INFO: (cdf95202-abcb, 'tcp://0.0.0.0:4567') connection to peer 00000000-0000 with addr tcp://10.1.54.64:4567 timed out, no messages seen in PT3S, socket stats: rtt: 204 rttvar: 102 rto: 200000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 2999658372 last_delivered_since: 2999658372 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)                                          
2022-06-10 10:18:17.526  INFO: PC protocol downgrade 1 -> 0
2022-06-10 10:18:17.527  INFO: Current view of cluster as seen by this node
view ((empty))
2022-06-10 10:18:17.528 ERROR: failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout): 110 (Connection timed out)                                                       
         at gcomm/src/pc.cpp:connect():161
2022-06-10 10:18:17.528 ERROR: gcs/src/gcs_core.cpp:gcs_core_open():219: Failed to open backend connection: -110 (Connection timed out)                                                                            
2022-06-10 10:18:18.529  INFO: gcomm: terminating thread
2022-06-10 10:18:18.529  INFO: gcomm: joining thread
2022-06-10 10:18:18.529 ERROR: gcs/src/gcs.cpp:gcs_open():1758: Failed to open channel 'uat_ossvpnsuite' at 'gcomm://10.1.54.64:4567,10.1.54.65:4567': -110 (Connection timed out)                                 
2022-06-10 10:18:18.529  INFO: Shifting CLOSED -> DESTROYED (TO: 0)
2022-06-10 10:18:18.529 FATAL: Garbd exiting with error: Failed to open connection to group: 110 (Connection timed out)                                                                                            
         at garb/garb_gcs.cpp:Gcs():35

The certificates are copied from node01 and the y are identical on all nodes.

percona sql config:

[mysqld]
transaction-isolation = READ-COMMITTED
max_connect_errors = 9999999
wait_timeout = 9999999
bind-address  = 0.0.0.0
datadir       = /var/lib/mysql/
user          = mysql
pid-file      = /var/run/mysqld/mysqld.pid
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_provider_options="gcache.size=512M"
wsrep_cluster_address=gcomm://10.1.54.64,10.1.54.65
binlog_format=ROW
wsrep_node_address=10.1.54.65
wsrep_sst_method=xtrabackup-v2
wsrep_cluster_name=uat_ossvpnsuite
wsrep_node_name=nluu-ossvpnsuitedb02
pxc_strict_mode=ENFORCING
log_timestamps=SYSTEM

Do you have any ideas how to fix the problem?

1 Like

I don’t see any SSL issues in your error logs. I see typical “connection timed out” messages which tells me there are basic port/IP/networking issues between your garbd server and the other nodes. While on the garbd node, can you telnet to any other node over port 4567?

1 Like

Hi,
I am Alex colleague and there is no network/communication issue between garbd node and db nodes.
The arbitrator is working ok only if pxc_encrypt_cluster_traffic = OFF (in galera conf file)

1 Like

You’re missing socket.ssl=YES; in your garbd options string. Without this, garbd doesn’t attempt to connect using SSL.

1 Like

Yeap, already added it on the configuration file on db nodes and arbitrator

1 Like