Avoid SST blocking when adding a new node?

Hello,

My upgrade from 5.7 to 8.0 failed and i had to “bootstrap” a node in 8.0 to make my production available again.

But now if i try to add a new node, it kill my primary (and alone node): donor goes in “desync / donor mode” non primary so it break again my production…

My databases is around 300Go so it take 25min to resync.

I’m using default configuration with wsrep_sst_method=xtrabackup-v2

is there a less aggressive way to add a node without stopping the donor node?

Thanks

Yathus

This should not happen. Donor mode should not break PRIMARY status. There might be something else happening here.

In any case, you use xtrabackup to take a backup of node1. Add --galera-info to the command. Copy the backup to node2. Prepare the backup as usual. Copy the grastate.dat file from node1, and recreate this file on node2 using the contents from xtrabackup_galera_info. Then you can start node2 and it should IST.

This should not happen. Donor mode should not break PRIMARY status. There might be something else happening here.

I’ve always had this problem when my cluster was already in 5.7. At first I thought it was the xinetd check that was returning the wrong information, so I disabled the check in my Haproxy configuration, but the client connections were still being rejected : “Communication link failure: 1047 WSREP has not yet prepared node for application use”

In any case, you use xtrabackup to take a backup of node1. Add --galera-info to the command. Copy the backup to node2. Prepare the backup as usual. Copy the grastate.dat file from node1, and recreate this file on node2 using the contents from xtrabackup_galera_info. Then you can start node2 and it should IST.

Something like this :

#on both nodes
rm -rf /mysql-backup/*

#on primary node (the one boostraped)
xtrabackup --backup --target-dir=/mysql-backup/base/ --galera-info
xtrabackup --backup --target-dir=/mysql-backup/inc1/ --incremental-basedir=/mysql-backup/base/ --galera-info

#sending backup to secondary one
rsync -a /mysql-backup/ root@IP_NODE_SECONDARY:/mysql-backup

#on secondary node (the one i want to add)
xtrabackup --prepare --apply-log-only --target-dir=/mysql-backup/base
xtrabackup --prepare --target-dir=/mysql-backup/base -incremental-dir=/mysql-backup/inc1

rm -rf /var/lib/mysql/*
rsync -avrP /mysql-backup/base/ /var/lib/mysql/

#edit grastate.dat and put data from xtrabackup_galera_info
nano /var/lib/mysql/grastate.dat

chown -Rf mysql.mysql /var/lib/mysql
systemctl start mysql

:crossed_fingers:

Skip the incremental. No need for that at all.

The xtrabackup_galera_info will only have the bare-bones info. You need to recreate the entire template/format by looking at the file on node1.

Also, add this to my.cnf: wsrep_provider_options="gcache.recover=yes"

Otherwise, yes.

Content of my “grastate.dat” :

# GALERA saved state
version: 2.1
uuid:    83673f76-d563-11ee-8507-0eb97f4c45f4
seqno:   -1
safe_to_bootstrap: 1

i have no xtrabackup_galera_info files in my /mysql-backup/inc1 or /mysql-backup/base folder after backup i dont know why. May be the “-1” value is a problem ?

After prepare i have a xtrabackup_galera_info file in my /mysql-backup/base.

It would appear that everything worked fine :

2024-02-27T16:46:59.357040Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/run/mysqld/mysqlx.sock
2024-02-27T16:46:59.357081Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.35-27.1'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  Percona XtraDB Cluster (GPL), Release rel27, Revision 84d9464, WSREP version 26.1.4.3.
2024-02-27T16:46:59.358477Z 3 [Note] [MY-000000] [Galera] Recovered view from SST:
  id: 83673f76-d563-11ee-8507-0eb97f4c45f4:624904
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: -1
  members(1):
	0: 8366f356-d563-11ee-b495-8eb3cb2b9cf2, pxc3-xxx

2024-02-27T16:46:59.358494Z 3 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-02-27T16:46:59.358574Z 18 [Note] [MY-000000] [WSREP] Recovered cluster id 83673f76-d563-11ee-8507-0eb97f4c45f4
2024-02-27T16:46:59.360505Z 3 [Note] [MY-000000] [Galera] SST received: 83673f76-d563-11ee-8507-0eb97f4c45f4:713480
2024-02-27T16:46:59.360523Z 3 [System] [MY-000000] [WSREP] SST completed
2024-02-27T16:46:59.360556Z 1 [Note] [MY-000000] [Galera]  str_proto_ver_: 3 sst_seqno_: 713480 cc_seqno: 848198 req->ist_len(): 72
2024-02-27T16:46:59.360574Z 1 [Note] [MY-000000] [Galera] Installed new state from SST: 83673f76-d563-11ee-8507-0eb97f4c45f4:713480
2024-02-27T16:46:59.360798Z 1 [Note] [MY-000000] [Galera] Receiving IST: 134718 writesets, seqnos 713481-848198
2024-02-27T16:46:59.360829Z 0 [Note] [MY-000000] [Galera] ####### IST applying starts with 713481
2024-02-27T16:46:59.360869Z 0 [Note] [MY-000000] [Galera] ####### IST current seqno initialized to 713481
2024-02-27T16:46:59.360927Z 0 [Note] [MY-000000] [Galera] Receiving IST...  0.0% (     0/134718 events) complete.
2024-02-27T16:46:59.360955Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-02-27T16:46:59.360982Z 0 [Note] [MY-000000] [Galera] ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:713480, protocol version: 5
2024-02-27T16:46:59.989636Z 0 [Note] [MY-000000] [Galera] IST preload starting at 831745
2024-02-27T16:47:00.143395Z 0 [Note] [MY-000000] [Galera] ####### Passing IST CC 848198, must_apply: 1, preload: true
2024-02-27T16:47:39.681915Z 0 [Warning] [MY-013865] [InnoDB] Redo log writer is waiting for a new redo log file. Consider increasing innodb_redo_log_capacity.
2024-02-27T16:47:48.680255Z 0 [Warning] [MY-013865] [InnoDB] Redo log writer is waiting for a new redo log file. Consider increasing innodb_redo_log_capacity.
2024-02-27T16:47:56.854525Z 0 [Warning] [MY-013865] [InnoDB] Redo log writer is waiting for a new redo log file. Consider increasing innodb_redo_log_capacity.
2024-02-27T16:48:11.695892Z 0 [Warning] [MY-013865] [InnoDB] Redo log writer is waiting for a new redo log file. Consider increasing innodb_redo_log_capacity.
2024-02-27T16:48:14.436473Z 0 [Note] [MY-000000] [Galera] REPL Protocols: 10 (5)
2024-02-27T16:48:14.437164Z 0 [Note] [MY-000000] [Galera] ####### Adjusting cert position: 848197 -> 848198
2024-02-27T16:48:14.437208Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-02-27T16:48:14.437889Z 0 [Note] [MY-000000] [Galera] Recording CC from ist: 848198
2024-02-27T16:48:14.438555Z 0 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from ist: 831745
2024-02-27T16:48:14.438742Z 0 [Note] [MY-000000] [Galera] Min available from gcache for CC from ist: 713481
2024-02-27T16:48:14.438767Z 0 [Note] [MY-000000] [Galera] Receiving IST...100.0% (134718/134718 events) complete.
2024-02-27T16:48:14.438773Z 10 [Note] [MY-000000] [Galera] ================================================
View:
  id: 83673f76-d563-11ee-8507-0eb97f4c45f4:848198
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 1
  members(2):
	0: 8366f356-d563-11ee-b495-8eb3cb2b9cf2, pxc3-xxx
	1: cfc798a3-d58f-11ee-a2ff-12bb64b1f4ce, pxc6-xxx
=================================================
2024-02-27T16:48:14.438803Z 10 [Note] [MY-000000] [WSREP] Server status change initialized -> joined
2024-02-27T16:48:14.438817Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-02-27T16:48:14.438834Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-02-27T16:48:14.438957Z 1 [Note] [MY-000000] [Galera] Draining apply monitors after IST up to 848198
2024-02-27T16:48:14.441463Z 1 [Note] [MY-000000] [Galera] IST received: 83673f76-d563-11ee-8507-0eb97f4c45f4:848198
2024-02-27T16:48:14.441523Z 1 [Note] [MY-000000] [Galera] Recording CC from sst: 848198
2024-02-27T16:48:14.441534Z 1 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from sst: 831745
2024-02-27T16:48:14.441540Z 1 [Note] [MY-000000] [Galera] Min available from gcache for CC from sst: 713481
2024-02-27T16:48:14.441804Z 0 [Note] [MY-000000] [Galera] 1.0 (pxc6-xxx): State transfer from 0.0 (pxc3-xxx) complete.
2024-02-27T16:48:14.441838Z 0 [Note] [MY-000000] [Galera] SST leaving flow control
2024-02-27T16:48:14.441847Z 0 [Note] [MY-000000] [Galera] Shifting JOINER -> JOINED (TO: 851982)
2024-02-27T16:48:14.441889Z 0 [Note] [MY-000000] [Galera] Processing event queue:...  0.0% (   0/3811 events) complete.
2024-02-27T16:48:16.080930Z 0 [Note] [MY-000000] [Galera] Member 1.0 (pxc6-xxx) synced with group.
2024-02-27T16:48:16.080963Z 0 [Note] [MY-000000] [Galera] Processing event queue:...100.0% (3910/3910 events) complete.
2024-02-27T16:48:16.080977Z 0 [Note] [MY-000000] [Galera] Shifting JOINED -> SYNCED (TO: 852079)
2024-02-27T16:48:16.142616Z 12 [Note] [MY-000000] [Galera] Server pxc6-xxx synced with group
2024-02-27T16:48:16.142636Z 12 [Note] [MY-000000] [WSREP] Server status change joined -> synced
2024-02-27T16:48:16.142641Z 12 [Note] [MY-000000] [WSREP] Synchronized with group, ready for connections
2024-02-27T16:48:16.142654Z 12 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.

So I added a third node that uses the second and here is the log:

2024-02-27T17:01:44.704259Z 0 [Note] [MY-000000] [Galera] (cfc798a3-a2ff, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.0.0.67:4567 
2024-02-27T17:01:44.704972Z 0 [Note] [MY-000000] [Galera] (cfc798a3-a2ff, 'tcp://0.0.0.0:4567') connection established to 006bd2d3-839b tcp://10.0.0.67:4567
2024-02-27T17:01:44.722724Z 0 [Note] [MY-000000] [Galera] declaring 006bd2d3-839b at tcp://10.0.0.67:4567 stable
2024-02-27T17:01:44.722833Z 0 [Note] [MY-000000] [Galera] declaring 8366f356-b495 at tcp://10.0.0.63:4567 stable
2024-02-27T17:01:44.723229Z 0 [Note] [MY-000000] [Galera] Node 8366f356-b495 state primary
2024-02-27T17:01:44.723808Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,006bd2d3-839b,5)
memb {
	006bd2d3-839b,0
	8366f356-b495,0
	cfc798a3-a2ff,0
	}
joined {
	}
left {
	}
partitioned {
	}
)
2024-02-27T17:01:44.723887Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2024-02-27T17:01:44.724872Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
2024-02-27T17:01:44.724964Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID.
2024-02-27T17:01:45.205993Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: 00b861d6-d592-11ee-af89-9a8cdc11d91a
2024-02-27T17:01:45.206644Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: 00b861d6-d592-11ee-af89-9a8cdc11d91a from 0 (pxc7-xxx)
2024-02-27T17:01:45.206826Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: 00b861d6-d592-11ee-af89-9a8cdc11d91a from 1 (pxc3-xxx)
2024-02-27T17:01:45.206923Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: 00b861d6-d592-11ee-af89-9a8cdc11d91a from 2 (pxc6-xxx)
2024-02-27T17:01:45.206963Z 0 [Note] [MY-000000] [Galera] Quorum results:
	version    = 6,
	component  = PRIMARY,
	conf_id    = 4,
	members    = 2/3 (primary/total),
	act_id     = 888618,
	last_appl. = 888560,
	protocols  = 2/10/4 (gcs/repl/appl),
	vote policy= 0,
	group UUID = 83673f76-d563-11ee-8507-0eb97f4c45f4
2024-02-27T17:01:45.207083Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [173, 173]
2024-02-27T17:01:45.207413Z 1 [Note] [MY-000000] [Galera] ####### processing CC 888619, local, ordered
2024-02-27T17:01:45.207488Z 1 [Note] [MY-000000] [Galera] Maybe drain monitors from 888618 upto current CC event 888619 upto:888618
2024-02-27T17:01:45.207523Z 1 [Note] [MY-000000] [Galera] Drain monitors from 888618 up to 888618
2024-02-27T17:01:45.207560Z 1 [Note] [MY-000000] [Galera] ####### My UUID: cfc798a3-d58f-11ee-a2ff-12bb64b1f4ce
2024-02-27T17:01:45.207591Z 1 [Note] [MY-000000] [Galera] Skipping cert index reset
2024-02-27T17:01:45.207619Z 1 [Note] [MY-000000] [Galera] REPL Protocols: 10 (5)
2024-02-27T17:01:45.207650Z 1 [Note] [MY-000000] [Galera] ####### Adjusting cert position: 888618 -> 888619
2024-02-27T17:01:45.207772Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-02-27T17:01:45.208875Z 1 [Note] [MY-000000] [Galera] ================================================
View:
  id: 83673f76-d563-11ee-8507-0eb97f4c45f4:888619
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 2
  members(3):
	0: 006bd2d3-d592-11ee-839b-6a6bcf58eff1, pxc7-xxx
	1: 8366f356-d563-11ee-b495-8eb3cb2b9cf2, pxc3-xxx
	2: cfc798a3-d58f-11ee-a2ff-12bb64b1f4ce, pxc6-xxx
=================================================
2024-02-27T17:01:45.208953Z 1 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-02-27T17:01:45.212288Z 1 [Note] [MY-000000] [Galera] Recording CC from group: 888619
2024-02-27T17:01:45.212381Z 1 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from group: 888561
2024-02-27T17:01:45.212437Z 1 [Note] [MY-000000] [Galera] Min available from gcache for CC from group: 713481
2024-02-27T17:01:45.814852Z 0 [Note] [MY-000000] [Galera] Member 0.0 (pxc7-xxx) requested state transfer from 'pxc6-xxx'. Selected 2.0 (pxc6-xxx)(SYNCED) as donor.
2024-02-27T17:01:45.814955Z 0 [Note] [MY-000000] [Galera] Shifting SYNCED -> DONOR/DESYNCED (TO: 888656)
2024-02-27T17:01:45.815066Z 11 [Note] [MY-000000] [Galera] Detected STR version: 1, req_len: 129, req: STRv1
2024-02-27T17:01:45.815207Z 11 [Note] [MY-000000] [Galera] Cert index preload: 888561 -> 888619
2024-02-27T17:01:45.815916Z 11 [Note] [MY-000000] [WSREP] Server status change synced -> donor
2024-02-27T17:01:45.816036Z 0 [Note] [MY-000000] [Galera] async IST sender starting to serve tcp://10.0.0.67:4568 sending 888561-888619, preload starts from 888561
2024-02-27T17:01:45.816010Z 11 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-02-27T17:01:45.816469Z 0 [Note] [MY-000000] [Galera] IST sender 888561 -> 888619
2024-02-27T17:01:45.817336Z 0 [Note] [MY-000000] [WSREP] Initiating SST/IST transfer on DONOR side (wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.0.67:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib/mysql/plugin/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --mysqld-version '8.0.35-27.1'   '' --gtid '83673f76-d563-11ee-8507-0eb97f4c45f4:888656' )
2024-02-27T17:01:45.829867Z 11 [Note] [MY-000000] [WSREP] DONOR thread signaled with 0
2024-02-27T17:01:46.214409Z 1799 [Warning] [MY-013712] [Server] No suitable 'keyring_component_metadata_query' service implementation found to fulfill the request.
2024-02-27T17:01:47.928694Z 0 [Note] [MY-000000] [Galera] (cfc798a3-a2ff, 'tcp://0.0.0.0:4567') turning message relay requesting off
2024-02-27T17:01:57.233185Z 0 [Note] [MY-000000] [WSREP-SST] Streaming the backup to joiner at 10.0.0.67 4444
2024-02-27T17:01:57.284459Z 1818 [Warning] [MY-013712] [Server] No suitable 'keyring_component_metadata_query' service implementation found to fulfill the request.

My pxc6-xxx is now in “Donor / Desynced”, confirmed by show status like 'wsrep%';

Looks like you might be missing a keyring plugin. Do you have encrypted tables? Make sure the configs are all correct.

i have no encryption tables.

looks like it’s a known bug: Warning message about keyring component metadata query service in error.log

Still active in 8.0.32 too : https://bugs.mysql.com/bug.php?id=103684

i can reproduce the bug with this simple select :

SELECT * FROM  performance_schema.keyring_component_status;

I suggest you load the basic keyring_file plugin and see if that solves the issue.

Do you think that plugin warning can affect SST process ?

My cluster come from 5.6, upgrade to 5.7 and now to 8.0 …

It’s possible. Nothing wrong with trying what I suggested.