The node does not join back into the cluster

Pavlo_Tkachenko · October 27, 2023, 9:16am

I have a cluster of 3 nodes, I started receiving error messages WSREP has not yet prepared node for application use.

After analyzing the problem, I noticed that yesterday server3 disconnected and works separately, I restarted mysql on server3, but server3 does not connect, instead I get an error

Job for mysql.service failed because the control process exited with error code.
See "systemctl status mysql.service" and "journalctl -xeu mysql.service" for details.

Logs:

2023-10-27T09:11:50.698873Z 0 [Warning] [MY-011068] [Server] The syntax 'log_slow_slave_statements' is deprecated and will be removed in a future release. Please use log_slow_replica_statements instead.2023-10-27T09:11:50.698890Z 0 [Warning] [MY-011068] [Server] The syntax 'wsrep_slave_threads' is deprecated and will be removed in a future release. Please use wsrep_applier_threads instead.
2023-10-27T09:11:50.699557Z 0 [Warning] [MY-010097] [Server] Insecure configuration for --secure-log-path: Current value does not restrict location of generated files. Consider setting it to a valid, non-empty path.
2023-10-27T09:11:50.699771Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.33-25.1) starting as process 2039913
2023-10-27T09:11:50.706421Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2023-10-27T09:11:50.706434Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
2023-10-27T09:11:50.706439Z 0 [Note] [MY-000000] [WSREP] New joining cluster node configured to use specified SSL artifacts
2023-10-27T09:11:50.706455Z 0 [Note] [MY-000000] [Galera] Loading provider /usr/lib/galera4/libgalera_smm.so initial position: 087fd47b-56ee-11ee-b1f6-0e0dfb3ecf18:16050461
2023-10-27T09:11:50.706468Z 0 [Note] [MY-000000] [Galera] wsrep_load(): loading provider library '/usr/lib/galera4/libgalera_smm.so'
2023-10-27T09:11:50.706763Z 0 [Note] [MY-000000] [Galera] wsrep_load(): Galera 4.15(6983fb2) by Codership Oy <info@codership.com> (modified by Percona <https://percona.com/>) loaded successfully.
2023-10-27T09:11:50.706774Z 0 [Note] [MY-000000] [Galera] CRC-32C: using 64-bit x86 acceleration.
2023-10-27T09:11:50.706876Z 0 [Warning] [MY-000000] [Galera] SSL compression is not effective. The option socket.ssl_compression is deprecated and will be removed in future releases.
2023-10-27T09:11:50.706880Z 0 [Warning] [MY-000000] [Galera] Parameter 'socket.ssl_compression' is deprecated and will be removed in future versions
2023-10-27T09:11:50.707926Z 0 [Note] [MY-000000] [Galera] Found saved state: 087fd47b-56ee-11ee-b1f6-0e0dfb3ecf18:16050461, safe_to_bootstrap: 0
2023-10-27T09:11:50.707961Z 0 [Note] [MY-000000] [Galera] GCache DEBUG: opened preamble:
Version: 2
UUID: 087fd47b-56ee-11ee-b1f6-0e0dfb3ecf18
Seqno: 15857919 - 16050461
Offset: 127383736
Synced: 1
EncVersion: 1
Encrypted: 0
MasterKeyConst UUID: 4e02820b-56ee-11ee-b93e-e627301e8e8e
MasterKey UUID: 00000000-0000-0000-0000-000000000000
MasterKey ID: 0
2023-10-27T09:11:50.707965Z 0 [Note] [MY-000000] [Galera] Recovering GCache ring buffer: version: 2, UUID: 087fd47b-56ee-11ee-b1f6-0e0dfb3ecf18, offset: 127383736
2023-10-27T09:11:50.708005Z 0 [Note] [MY-000000] [Galera] GCache::RingBuffer initial scan...  0.0% (        0/134217752 bytes) complete.
2023-10-27T09:11:50.735591Z 0 [Note] [MY-000000] [Galera] GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.
2023-10-27T09:11:50.736317Z 0 [Note] [MY-000000] [Galera] Recovering GCache ring buffer: found gapless sequence 15857919-16050461
2023-10-27T09:11:50.736331Z 0 [Note] [MY-000000] [Galera] GCache::RingBuffer unused buffers scan...  0.0% (        0/134215904 bytes) complete.
2023-10-27T09:11:50.739229Z 0 [Note] [MY-000000] [Galera] GCache::RingBuffer unused buffers scan...100.0% (134215904/134215904 bytes) complete.
2023-10-27T09:11:50.739233Z 0 [Note] [MY-000000] [Galera] Recovering GCache ring buffer: found 0/192543 locked buffers
2023-10-27T09:11:50.739236Z 0 [Note] [MY-000000] [Galera] Recovering GCache ring buffer: free space: 1824/134217728
2023-10-27T09:11:50.745133Z 0 [Note] [MY-000000] [Galera] Passing config to GCS: allocator.disk_pages_encryption = no; allocator.encryption_cache_page_size = 32K; allocator.encryption_cache_size = 16777216; base_dir = /var/lib/mysql/; base_host = server3; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 10; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 4; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.encryption = no; gcache.encryption_cache_page_size = 32K; gcache.encryption_cache_size = 16777216; gcache.freeze_purge_at_seqno = -1; gcache.keep_pages_count = 0; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_auto_evict_threshold = 0.75; gcs.fc_auto_evict_window = 0; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 100; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = true; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.wait_restored_prim_timeout = PT0S; pc.weight = 1; protonet.backend = asio; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 10; socket.checksum = 2; socket.recv_buf_size = auto; socket.send_buf_size = auto; socket.ssl = YES; socket.ssl_ca = ca.pem; socket.ssl_cert = server-cert.pem; socket.ssl_cipher = ; socket.ssl_compression = YES; socket.ssl_key = server-key.pem; socket.ssl_reload = 1; 
2023-10-27T09:11:50.751150Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2023-10-27T09:11:50.751200Z 0 [Note] [MY-000000] [Galera] ####### Assign initial position for certification: 087fd47b-56ee-11ee-b1f6-0e0dfb3ecf18:16050461, protocol version: -1
2023-10-27T09:11:50.751238Z 0 [Note] [MY-000000] [WSREP] Starting replication
2023-10-27T09:11:50.751243Z 0 [Note] [MY-000000] [Galera] Connecting with bootstrap option: 0
2023-10-27T09:11:50.751249Z 0 [Note] [MY-000000] [Galera] Setting GCS initial position to 087fd47b-56ee-11ee-b1f6-0e0dfb3ecf18:16050461
2023-10-27T09:11:50.751278Z 0 [Note] [MY-000000] [Galera] protonet asio version 0
2023-10-27T09:11:50.752428Z 0 [Note] [MY-000000] [Galera] Using CRC-32C for message checksums.
2023-10-27T09:11:50.752437Z 0 [Note] [MY-000000] [Galera] backend: asio
2023-10-27T09:11:50.752470Z 0 [Note] [MY-000000] [Galera] gcomm thread scheduling priority set to other:0 
2023-10-27T09:11:50.752508Z 0 [Note] [MY-000000] [Galera] Fail to access the file (/var/lib/mysql//gvwstate.dat) error (No such file or directory). It is possible if node is booting for first time or re-booting after a graceful shutdown
2023-10-27T09:11:50.752511Z 0 [Note] [MY-000000] [Galera] Restoring primary-component from disk failed. Either node is booting for first time or re-booting after a graceful shutdown
2023-10-27T09:11:50.752563Z 0 [Note] [MY-000000] [Galera] GMCast version 0
2023-10-27T09:11:50.752617Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') listening at ssl://0.0.0.0:4567
2023-10-27T09:11:50.752620Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') multicast: , ttl: 1
2023-10-27T09:11:50.752720Z 0 [Note] [MY-000000] [Galera] EVS version 1
2023-10-27T09:11:50.752742Z 0 [Note] [MY-000000] [Galera] gcomm: connecting to group 'pxc-cluster', peer 'server1:,server2:,server3:'
2023-10-27T09:11:50.755112Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address ssl://server3:4567
2023-10-27T09:11:53.753024Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://server2:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 2488949800 cwnd: 1 last_queued_since: 2489249931426227 last_delivered_since: 2489249931426227 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2023-10-27T09:11:53.753072Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://server1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 2488949800 cwnd: 1 last_queued_since: 2489249931486279 last_delivered_since: 2489249931486279 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2023-10-27T09:11:53.753091Z 0 [Note] [MY-000000] [Galera] announce period timed out (pc.announce_timeout)
2023-10-27T09:11:53.753119Z 0 [Note] [MY-000000] [Galera] EVS version upgrade 0 -> 1
2023-10-27T09:11:53.753126Z 0 [Note] [MY-000000] [Galera] PC protocol upgrade 0 -> 1
2023-10-27T09:11:53.753139Z 0 [Warning] [MY-000000] [Galera] no nodes coming from prim view, prim not possible
2023-10-27T09:11:53.753150Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,dd14e23e-9dee,1)
memb {
        dd14e23e-9dee,0
        }
joined {
        }
left {
        }
partitioned {
        }
)
2023-10-27T09:11:54.253193Z 0 [Warning] [MY-000000] [Galera] last inactive check more than PT1.5S (3*evs.inactive_check_period) ago (PT3.50047S), skipping check
2023-10-27T09:11:58.253216Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://server2:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 2488954300 cwnd: 1 last_queued_since: 2489254431608847 last_delivered_since: 2489254431608847 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2023-10-27T09:12:01.753319Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://server1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 2488957800 cwnd: 1 last_queued_since: 2489257931716138 last_delivered_since: 2489257931716138 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2023-10-27T09:12:05.253430Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://server2:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 2488961300 cwnd: 1 last_queued_since: 2489261431836525 last_delivered_since: 2489261431836525 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2023-10-27T09:12:08.253556Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://server1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 2000000 lost: 1 last_data_recv: 2488964300 cwnd: 1 last_queued_since: 2489264431950578 last_delivered_since: 2489264431950578 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2023-10-27T09:12:11.753677Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://server2:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 2488967804 cwnd: 1 last_queued_since: 2489267932071021 last_delivered_since: 2489267932071021 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2023-10-27T09:12:15.253771Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://server1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 2488971304 cwnd: 1 last_queued_since: 2489271432178402 last_delivered_since: 2489271432178402 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2023-10-27T09:12:18.253891Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://server2:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 2488974304 cwnd: 1 last_queued_since: 2489274432295063 last_delivered_since: 2489274432295063 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2023-10-27T09:12:21.754030Z 0 [Note] [MY-000000] [Galera] (dd14e23e-9dee, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://server1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 0 rttvar: 250000 rto: 4000000 lost: 1 last_data_recv: 2488977804 cwnd: 1 last_queued_since: 2489277932421631 last_delivered_since: 2489277932421631 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2023-10-27T09:12:23.756130Z 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2023-10-27T09:12:23.756148Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2023-10-27T09:12:23.756199Z 0 [ERROR] [MY-000000] [Galera] failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout): 110 (Connection timed out)
         at /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.33-25/percona-xtradb-cluster-galera/gcomm/src/pc.cpp:connect():176
2023-10-27T09:12:23.756208Z 0 [ERROR] [MY-000000] [Galera] /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.33-25/percona-xtradb-cluster-galera/gcs/src/gcs_core.cpp:gcs_core_open():219: Failed to open backend connection: -110 (Connection timed out)
2023-10-27T09:12:24.756298Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread
2023-10-27T09:12:24.756315Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread
2023-10-27T09:12:24.756412Z 0 [ERROR] [MY-000000] [Galera] /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.33-25/percona-xtradb-cluster-galera/gcs/src/gcs.cpp:gcs_open():1876: Failed to open channel 'pxc-cluster' at 'gcomm://server1,server2,server3': -110 (Connection timed out)
2023-10-27T09:12:24.756420Z 0 [ERROR] [MY-000000] [Galera] gcs connect failed: Connection timed out
2023-10-27T09:12:24.756426Z 0 [ERROR] [MY-000000] [WSREP] Provider/Node (gcomm://server1,server2,server3) failed to establish connection with cluster (reason: 7)
2023-10-27T09:12:24.756432Z 0 [ERROR] [MY-010119] [Server] Aborting
2023-10-27T09:12:24.756604Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.33-25.1)  Percona XtraDB Cluster (GPL), Release rel25, Revision 0c56202, WSREP version 26.1.4.3.
2023-10-27T09:12:24.757019Z 0 [ERROR] [MY-010065] [Server] Failed to shutdown components infrastructure.
2023-10-27T09:12:24.757122Z 0 [Note] [MY-000000] [Galera] dtor state: CLOSED
2023-10-27T09:12:24.757132Z 0 [Note] [MY-000000] [Galera] MemPool(TrxHandleSlave): hit ratio: 0, misses: 0, in use: 0, in pool: 0
2023-10-27T09:12:24.758384Z 0 [Note] [MY-000000] [Galera] apply mon: entered 0
2023-10-27T09:12:24.759521Z 0 [Note] [MY-000000] [Galera] apply mon: entered 0
2023-10-27T09:12:24.760696Z 0 [Note] [MY-000000] [Galera] apply mon: entered 0
2023-10-27T09:12:24.760702Z 0 [Note] [MY-000000] [Galera] cert index usage at exit 0
2023-10-27T09:12:24.760706Z 0 [Note] [MY-000000] [Galera] cert trx map usage at exit 0
2023-10-27T09:12:24.760710Z 0 [Note] [MY-000000] [Galera] deps set usage at exit 0
2023-10-27T09:12:24.760715Z 0 [Note] [MY-000000] [Galera] avg deps dist 0
2023-10-27T09:12:24.760720Z 0 [Note] [MY-000000] [Galera] avg cert interval 0
2023-10-27T09:12:24.760724Z 0 [Note] [MY-000000] [Galera] cert index size 0
2023-10-27T09:12:24.760756Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2023-10-27T09:12:24.760789Z 0 [Note] [MY-000000] [Galera] wsdb trx map usage 0 conn query map usage 0
2023-10-27T09:12:24.760795Z 0 [Note] [MY-000000] [Galera] MemPool(LocalTrxHandle): hit ratio: 0, misses: 0, in use: 0, in pool: 0
2023-10-27T09:12:24.760857Z 0 [Note] [MY-000000] [Galera] Shifting CLOSED -> DESTROYED (TO: 0)
2023-10-27T09:12:24.766762Z 0 [Note] [MY-000000] [Galera] Flushing memory map to disk...

I know it can help to stop all nodes and reinitialize again, but it is preferable not to stop the database.

Please help me understand how to attach the node back without reloading cluster
You already answered me before, but a month later we ran into it again

The server responds to ping, but not to telnet

Also, how can I analyze the problem in detail, why server3 disconnected from the cluster?

matthewb · October 27, 2023, 10:03pm

addr ssl://server2:4567 timed out, no messages seen in PT3S

If server2 is online, and telnet server2 4567 is not working/not responding, then you have some sort of firewall/network issue. This is probably the reason why server3 disconnected; it was unable to maintain network connectivity with the other members.

Pavlo_Tkachenko · February 12, 2024, 3:41pm

Hi! it happened again.1, 3 nodes work, the second cannot connect with such a log.
telnet is responding normally. I need to get the second node back without downtime

0-0000 with addr ssl://node1:4567 timed out, no messages seen in PT3S, socket stats: rtt: 214 rttvar: 82 rto: 204000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 1023327731776 last_delivered_since: 1023327731776 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
2024-02-12T15:11:41.430940Z 0 [Note] [MY-000000] [Galera] (4fc9b25e-8017, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://node3:4567 timed out, no messages seen in PT3S, socket stats: rtt: 118 rttvar: 45 rto: 204000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 1026327999362 last_delivered_since: 1026327999362 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)

Work was resumed by re-initializing the cluster. Could you please suggest how this can be prevented? We have no problems with the network, the ports were available, but without reinitialization, the second node did not want to connect

matthewb · February 12, 2024, 7:16pm

In the little logs provided, I see this repeated 2 times. This points to network issues.

Again, network issues. No heartbeats sent/received between nodes within 3 seconds creates outage situation.

Use PMM to monitor all 3 nodes’ traffic/tcp information.

Topic		Replies	Views
Node 2 and 3 not joining cluster Percona XtraDB Cluster 8.x	5	1073	March 30, 2022
After systemctl restart mysql, the node does not connect to the cluster Percona XtraDB Cluster 8.x	4	1386	September 29, 2023
Percona cluster failed to connect backend connection Percona XtraDB Cluster 5.x	1	5399	March 30, 2017
First node fails to restart after bootstrap Percona XtraDB Cluster 5.x	13	20489	April 5, 2017
All nodes in the cluster becomes inaccessible Percona XtraDB Cluster 5.x	9	5475	July 31, 2014

The node does not join back into the cluster

Related topics