Hello, Percona Community,
I’m reaching out for assistance with a perplexing SST issue in our Percona XtraDB Cluster (PXC) setup. Despite several diagnostic steps and configurations, SST fails consistently, irrespective of the donor node or whether the joiner is across a VPN or in the same subnet. Below is a detailed overview of our architecture, the steps taken, and where we stand.
Cluster Architecture Summary:
- Cluster Size: 4 nodes (after adding a new node for diagnostics)
- PXC Version: 8.x
- Deployment Environment: Hybrid, spanning two OpenStack clouds, with an interconnecting VPN tunnel.
- Node Details:
- Node1 and Node2: Located in Cloud A, functioning correctly and fully synced.
- Node3: Initially the problematic node in Cloud B, failing to rejoin after a restart.
- Node4: Added within Cloud A to test if the SST issue was VPN-related.
Issue Description:
Node3’s failure to rejoin the cluster led to an SST attempt, which continually failed. To eliminate VPN misconfiguration concerns, Node4 was introduced in the same cloud and subnet as Node1 and Node2. Surprisingly, Node4 faced the same SST failure, echoing Node3’s issues, and indicating the problem lies beyond just network or VPN configurations.
Steps Taken:
- Network Checks: Confirmed robust connectivity across all nodes, including the VPN.
- SST Configuration: Initially attempted to specify a preferred donor without success. Subsequent tests did not rely on forced donor selection, allowing the cluster to choose between Node1 and Node2 as donors.
- Troubleshooting Actions: Removed
sst_in_progress
markers, reviewed and adjusted permissions, and closely inspected MySQL logs for both donors and joiners. - Diagnostic Node Addition: Deployed Node4 within the same subnet as the synced nodes to rule out VPN-related issues, only to encounter identical SST failures.
Note, that I’ll provide you with logs and configuration from Node4, launched within the same subnet as Node1 and Node2.
Node4 Config:
[mysqld]
bind-address = 0.0.0.0
server-id=4
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_cluster_name="DAR_MySQL_cluster"
wsrep_cluster_address=gcomm://172.30.0.39,172.30.0.202,10.9.0.2
#wsrep_sst_donor="dar-mysql-cluster-node-waw3-21"
wsrep_node_name="dar-mysql-cluster-node-waw3-23"
wsrep_node_address=172.30.0.183
pxc_strict_mode=PERMISSIVE
wsrep_sst_method=xtrabackup-v2
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
wsrep_provider_options="socket.ssl_key=/etc/mysql/certs/galera-node-key.pem;socket.ssl_cert=/etc/mysql/certs/galera-node-cert.pem;socket.ssl_ca=/etc/mysql/certs/new-ca.pem"
ssl-key=/etc/mysql/certs/galera-node-key.pem
ssl-ca=/etc/mysql/certs/new-ca.pem
ssl-cert=/etc/mysql/certs/galera-node-cert.pem
[sst]
encrypt=4
Where 172.30.0.39, 172.30.0.202 are private IPs of Node1 and Node2 and 10.9.0.2 is the virtual private address of Node3 (inactive).
Now, Node4 fails with:
[...]
024-02-03T11:19:35.264492Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [173, 173]
2024-02-03T11:19:35.264498Z 0 [Note] [MY-000000] [Galera] Shifting OPEN -> PRIMARY (TO: 65857)
2024-02-03T11:19:35.264554Z 1 [Note] [MY-000000] [Galera] ####### processing CC 65857, local, ordered
2024-02-03T11:19:35.264569Z 1 [Note] [MY-000000] [Galera] Maybe drain monitors from -1 upto current CC event 65857 upto:-1
2024-02-03T11:19:35.264575Z 1 [Note] [MY-000000] [Galera] Drain monitors from -1 up to -1
2024-02-03T11:19:35.264583Z 1 [Note] [MY-000000] [Galera] Process first view: c5e2d4d1-b610-11ee-98f3-83c308702156 my uuid: 1bc8d4de-c286-11ee-9fca-e31ba5e261cf
2024-02-03T11:19:35.264594Z 1 [Note] [MY-000000] [Galera] Server dar-mysql-cluster-node-waw3-23 connected to cluster at position c5e2d4d1-b610-11ee-98f3-83c308702156:65857 with ID 1bc8d4de-c286-11ee-9fca-e31ba5e261cf
2024-02-03T11:19:35.264600Z 1 [Note] [MY-000000] [WSREP] Server status change disconnected -> connected
2024-02-03T11:19:35.264617Z 1 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-02-03T11:19:35.264638Z 1 [Note] [MY-000000] [Galera] ####### My UUID: 1bc8d4de-c286-11ee-9fca-e31ba5e261cf
2024-02-03T11:19:35.264648Z 1 [Note] [MY-000000] [Galera] Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
2024-02-03T11:19:35.264684Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-02-03T11:19:35.264741Z 1 [Note] [MY-000000] [Galera] ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: -1
2024-02-03T11:19:35.264760Z 1 [Note] [MY-000000] [Galera] State transfer required:
Group state: c5e2d4d1-b610-11ee-98f3-83c308702156:65857
Local state: 00000000-0000-0000-0000-000000000000:-1
2024-02-03T11:19:35.264766Z 1 [Note] [MY-000000] [WSREP] Server status change connected -> joiner
2024-02-03T11:19:35.264771Z 1 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-02-03T11:19:35.264843Z 0 [Note] [MY-000000] [WSREP] Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner' --address '172.30.0.183' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib/mysql/plugin/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '12347' --mysqld-version '8.0.35-27.1' '' )
2024-02-03T11:19:35.473090Z 0 [Warning] [MY-000000] [WSREP-SST] Found a stale sst_in_progress file: /var/lib/mysql//sst_in_progress
2024-02-03T11:19:35.757956Z 1 [Note] [MY-000000] [WSREP] Prepared SST request: xtrabackup-v2|172.30.0.183:4444/xtrabackup_sst//1
2024-02-03T11:19:35.758007Z 1 [Note] [MY-000000] [Galera] Check if state gap can be serviced using IST
2024-02-03T11:19:35.758034Z 1 [Note] [MY-000000] [Galera] Local UUID: 00000000-0000-0000-0000-000000000000 != Group UUID: c5e2d4d1-b610-11ee-98f3-83c308702156
2024-02-03T11:19:35.758046Z 1 [Note] [MY-000000] [Galera] ####### IST uuid:00000000-0000-0000-0000-000000000000 f: 0, l: 65857, STRv: 3
2024-02-03T11:19:35.758119Z 1 [Note] [MY-000000] [Galera] IST receiver addr using ssl://172.30.0.183:4568
2024-02-03T11:19:35.758150Z 1 [Note] [MY-000000] [Galera] IST receiver using ssl
2024-02-03T11:19:35.758256Z 1 [Note] [MY-000000] [Galera] Prepared IST receiver for 0-65857, listening at: ssl://172.30.0.183:4568
2024-02-03T11:19:35.759657Z 0 [Note] [MY-000000] [Galera] Member 0.0 (dar-mysql-cluster-node-waw3-23) requested state transfer from '*any*'. Selected 1.0 (dar-mysql-cluster-node-waw3-22)(SYNCED) as donor.
2024-02-03T11:19:35.759672Z 0 [Note] [MY-000000] [Galera] Shifting PRIMARY -> JOINER (TO: 65857)
2024-02-03T11:19:35.759686Z 1 [Note] [MY-000000] [Galera] Requesting state transfer: success, donor: 1
2024-02-03T11:19:35.759693Z 1 [Note] [MY-000000] [Galera] Resetting GCache seqno map due to different histories.
2024-02-03T11:19:35.759701Z 1 [Note] [MY-000000] [Galera] GCache history reset: c5e2d4d1-b610-11ee-98f3-83c308702156:0 -> c5e2d4d1-b610-11ee-98f3-83c308702156:65857
2024-02-03T11:19:35.772116Z 0 [Warning] [MY-000000] [Galera] 1.0 (dar-mysql-cluster-node-waw3-22): State transfer to 0.0 (dar-mysql-cluster-node-waw3-23) failed: -125 (Operation canceled)
2024-02-03T11:19:35.772142Z 0 [ERROR] [MY-000000] [Galera] /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.35-27/percona-xtradb-cluster-galera/gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1285: Will never receive state. Need to abort.
2024-02-03T11:19:35.772151Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread
[...]
Whereas on a donor, Node2 in this case (dar-mysql-cluster-node-waw3-22) i can see:
[...]
2024-02-03T11:19:35.265408Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-02-03T11:19:35.271112Z 10 [Note] [MY-000000] [Galera] Recording CC from group: 65857
2024-02-03T11:19:35.271129Z 10 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from group: 57217
2024-02-03T11:19:35.271135Z 10 [Note] [MY-000000] [Galera] Min available from gcache for CC from group: 10341
2024-02-03T11:19:35.752085Z 0 [Note] [MY-000000] [Galera] Member 0.0 (dar-mysql-cluster-node-waw3-23) requested state transfer from '*any*'. Selected 1.0 (dar-mysql-cluster-node-waw3-22)(SYNCED) as donor.
2024-02-03T11:19:35.752117Z 0 [Note] [MY-000000] [Galera] Shifting SYNCED -> DONOR/DESYNCED (TO: 65857)
2024-02-03T11:19:35.752201Z 10 [Note] [MY-000000] [Galera] Detected STR version: 1, req_len: 134, req: STRv1
2024-02-03T11:19:35.752243Z 10 [Note] [MY-000000] [Galera] Cert index preload: 57217 -> 65857
2024-02-03T11:19:35.763147Z 10 [Note] [MY-000000] [WSREP] Server status change synced -> donor
2024-02-03T11:19:35.763160Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-02-03T11:19:35.763231Z 0 [Note] [MY-000000] [Galera] async IST sender starting to serve ssl://172.30.0.183:4568 sending 57217-65857, preload starts from 57217
2024-02-03T11:19:35.763285Z 0 [Note] [MY-000000] [WSREP] Initiating SST/IST transfer on DONOR side (wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.30.0.183:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib/mysql/plugin/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --mysqld-version '8.0.35-27.1' '' --gtid 'c5e2d4d1-b610-11ee-98f3-83c308702156:65857' )
2024-02-03T11:19:35.763507Z 0 [ERROR] [MY-000000] [WSREP] Error accessing server with user:mysql.pxc.internal.session@localhost
2024-02-03T11:19:35.763551Z 10 [Note] [MY-000000] [WSREP] DONOR thread signaled with -125
2024-02-03T11:19:35.763573Z 10 [Warning] [MY-000000] [Galera] SST preparation failed
2024-02-03T11:19:35.763579Z 10 [Note] [MY-000000] [WSREP] Server status change donor -> joined
2024-02-03T11:19:35.763584Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-02-03T11:19:35.763590Z 10 [ERROR] [MY-000000] [Galera] SST failed: 1
2024-02-03T11:19:35.764412Z 0 [Warning] [MY-000000] [Galera] 1.0 (dar-mysql-cluster-node-waw3-22): State transfer to 0.0 (dar-mysql-cluster-node-waw3-23) failed: -125 (Operation canceled)
2024-02-03T11:19:35.764427Z 0 [Note] [MY-000000] [Galera] Shifting DONOR/DESYNCED -> JOINED (TO: 65857)
2024-02-03T11:19:35.764460Z 0 [Note] [MY-000000] [Galera] Processing event queue:... -nan% (0/0 events) complete.
2024-02-03T11:19:35.765224Z 0 [Note] [MY-000000] [Galera] Member 1.0 (dar-mysql-cluster-node-waw3-22) synced with group.
2024-02-03T11:19:35.765236Z 0 [Note] [MY-000000] [Galera] Processing event queue:...100.0% (1/1 events) complete.
2024-02-03T11:19:35.765242Z 0 [Note] [MY-000000] [Galera] Shifting JOINED -> SYNCED (TO: 65857)
2024-02-03T11:19:35.765235Z 0 [Note] [MY-000000] [Galera] declaring 71d3d731-b2bc at ssl://172.30.0.39:4567 stable
2024-02-03T11:19:35.765279Z 0 [Note] [MY-000000] [Galera] forgetting 1bc8d4de-9fca (ssl://172.30.0.183:4567)
2024-02-03T11:19:35.765289Z 11 [Note] [MY-000000] [Galera] Server dar-mysql-cluster-node-waw3-22 synced with group
2024-02-03T11:19:35.765311Z 11 [Note] [MY-000000] [WSREP] Server status change joined -> synced
2024-02-03T11:19:35.765318Z 11 [Note] [MY-000000] [WSREP] Synchronized with group, ready for connections
2024-02-03T11:19:35.765323Z 11 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-02-03T11:19:35.765625Z 0 [Note] [MY-000000] [Galera] Node 2a716e62-afaa state primary
2024-02-03T11:19:35.765924Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,2a716e62-afaa,44)
memb {
2a716e62-afaa,0
71d3d731-b2bc,0
}
joined {
}
left {
}
partitioned {
1bc8d4de-9fca,0
}
)
2024-02-03T11:19:35.765936Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2024-02-03T11:19:35.769138Z 0 [Note] [MY-000000] [Galera] forgetting 1bc8d4de-9fca (ssl://172.30.0.183:4567)
2024-02-03T11:19:35.769157Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
2024-02-03T11:19:35.769215Z 0 [Note] [MY-000000] [Galera] STATE_EXCHANGE: sent state UUID: 1caedb53-c286-11ee-a659-1254012a57c6
2024-02-03T11:19:35.772813Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: 1caedb53-c286-11ee-a659-1254012a57c6
2024-02-03T11:19:35.773069Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: 1caedb53-c286-11ee-a659-1254012a57c6 from 0 (dar-mysql-cluster-node-waw3-22)
2024-02-03T11:19:35.773092Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: 1caedb53-c286-11ee-a659-1254012a57c6 from 1 (dar-mysql-cluster-node-waw3-21)
2024-02-03T11:19:35.773107Z 0 [Note] [MY-000000] [Galera] Quorum results:
version = 6,
component = PRIMARY,
conf_id = 43,
members = 2/2 (primary/total),
act_id = 65857,
last_appl. = 57216,
protocols = 2/10/4 (gcs/repl/appl),
vote policy= 0,
group UUID = c5e2d4d1-b610-11ee-98f3-83c308702156
2024-02-03T11:19:35.773149Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [141, 141]
2024-02-03T11:19:35.773305Z 1 [Note] [MY-000000] [Galera] ####### processing CC 65858, local, ordered
2024-02-03T11:19:35.773321Z 1 [Note] [MY-000000] [Galera] Maybe drain monitors from 65857 upto current CC event 65858 upto:65857
2024-02-03T11:19:35.773329Z 1 [Note] [MY-000000] [Galera] Drain monitors from 65857 up to 65857
2024-02-03T11:19:35.773337Z 1 [Note] [MY-000000] [Galera] ####### My UUID: 2a716e62-bece-11ee-afaa-3b6138179185
2024-02-03T11:19:35.773342Z 1 [Note] [MY-000000] [Galera] Skipping cert index reset
2024-02-03T11:19:35.773348Z 1 [Note] [MY-000000] [Galera] REPL Protocols: 10 (5)
2024-02-03T11:19:35.773356Z 1 [Note] [MY-000000] [Galera] ####### Adjusting cert position: 65857 -> 65858
2024-02-03T11:19:35.773394Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-02-03T11:19:35.778714Z 1 [Note] [MY-000000] [Galera] ================================================
View:
id: c5e2d4d1-b610-11ee-98f3-83c308702156:65858
status: primary
protocol_version: 4
capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
final: no
own_index: 0
members(2):
0: 2a716e62-bece-11ee-afaa-3b6138179185, dar-mysql-cluster-node-waw3-22
1: 71d3d731-be19-11ee-b2bc-baa7e0e6a780, dar-mysql-cluster-node-waw3-21
=================================================
2024-02-03T11:19:35.778735Z 1 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-02-03T11:19:35.784796Z 0 [ERROR] [MY-000000] [Galera] async IST sender failed to serve ssl://172.30.0.183:4568: ist send failed: ', asio error 'error receiving handshake: 71 (Protocol error)
at /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.35-27/percona-xtradb-cluster-galera/galera/src/ist_proto.hpp:recv_handshake():319': 71 (Protocol error)
at /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/percona-xtradb-cluster-8.0.35-27/percona-xtradb-cluster-galera/galera/src/ist.cpp:send():862
2024-02-03T11:19:35.784993Z 0 [Note] [MY-000000] [Galera] async IST sender served
2024-02-03T11:19:35.786756Z 1 [Note] [MY-000000] [Galera] Recording CC from group: 65858
2024-02-03T11:19:35.786771Z 1 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from group: 57217
2024-02-03T11:19:35.786778Z 1 [Note] [MY-000000] [Galera] Min available from gcache for CC from group: 10341
2024-02-03T11:19:37.311010Z 0 [Note] [MY-000000] [Galera] (2a716e62-afaa, 'ssl://0.0.0.0:4567') turning message relay requesting off
2024-02-03T11:19:40.811224Z 0 [Note] [MY-000000] [Galera] cleaning up 1bc8d4de-9fca (ssl://172.30.0.183:4567)
[...]
I provided only the parts of a logfile i find important but can provide a full log if needed. The consistent failure of SST, regardless of the joiner’s location (local or across VPN), and the donor node (Node1 or Node2), has us puzzled. Both synced nodes are viable donors, yet both Node3 and the newly added Node4 fail in the same manner during the SST process.
I appreciate any advice, insights, or recommendations the community can provide to help navigate and resolve this issue. Thank you for your time and support.