Hi ,
I have faced issue the new setup xtradb cluster joiner keeps failing.
I checked joiner “.sst” directory
[root@mysql02 .sst]# du -sh
982G .
It’s syncing data at the beginnnig , but failed on the way to cluster. ( donor has 1.3TB data )
There’s the error :
Joiner error :
2016-04-15 14:49:00 3688 [Note] WSREP: Created page /xfs/mysql/gcache.page.000141 of size 134217728 bytes
2016-04-15 14:49:37 3688 [Note] WSREP: Created page /xfs/mysql/gcache.page.000142 of size 134217728 bytes
WSREP_SST: [ERROR] Removing /xfs/mysql//.sst/xtrabackup_galera_info file due to signal (20160415 14:49:59.564)
WSREP_SST: [ERROR] Cleanup after exit with status:143 (20160415 14:49:59.570)
2016-04-15 14:49:59 3688 [ERROR] WSREP: Process was aborted.
2016-04-15 14:49:59 3688 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.0.5.19' --datadir '/xfs/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '3688' '' : 10 (No child processes)
2016-04-15 14:49:59 3688 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2016-04-15 14:49:59 3688 [ERROR] WSREP: SST script aborted with error 10 (No child processes)
2016-04-15 14:49:59 3688 [ERROR] WSREP: SST failed: 10 (No child processes)
2016-04-15 14:49:59 3688 [ERROR] Aborting
2016-04-15 14:49:59 3688 [Warning] WSREP: 1.0 (): State transfer to 0.0 () failed: -22 (Invalid argument)
2016-04-15 14:49:59 3688 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():733: Will never receive state. Need to abort.
2016-04-15 14:49:59 3688 [Note] WSREP: gcomm: terminating thread
2016-04-15 14:49:59 3688 [Note] WSREP: gcomm: joining thread
2016-04-15 14:49:59 3688 [Note] WSREP: gcomm: closing backend
2016-04-15 14:50:00 3688 [Note] WSREP: view(view_id(NON_PRIM,66a7dc34,12) memb {
66a7dc34,0
} joined {
} left {
} partitioned {
a6827774,0
})
2016-04-15 14:50:00 3688 [Note] WSREP: view((empty))
2016-04-15 14:50:00 3688 [Note] WSREP: gcomm: closed
2016-04-15 14:50:00 3688 [Note] WSREP: /usr/sbin/mysqld: Terminated.
160415 14:50:01 mysqld_safe mysqld from pid file /xfs/mysql/mysql02.pid ended
Donor error:
WSREP_SST: [INFO] Evaluating xbstream -c ${INFO_FILE} | pigz | socat -u stdio TCP:10.0.5.19:4444 ; RC=( ${PIPESTATUS[@]} ) (20160415 12:21:19.695)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20160415 12:21:19.718)
2016-04-15 12:21:20 51913 [Note] WSREP: (a6827774, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [INFO] Streaming the backup to joiner at 10.0.5.19 4444 (20160415 12:21:29.735)
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/my.cnf --defaults-group=mysqld --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | pigz | socat -u stdio TCP:10.0.5.19:4444; RC=( ${PIPESTATUS[@]} ) (20160415 12:21:29.747)
2016-04-15 13:45:22 51913 [Note] WSREP: Created page /xfs/mysql/gcache.page.000059 of size 134217728 bytes
2016-04-15 13:45:25 51913 [Note] WSREP: Deleted page /xfs/mysql/gcache.page.000059
2016-04-15 14:45:22 51913 [Note] WSREP: Created page /xfs/mysql/gcache.page.000060 of size 134217728 bytes
2016-04-15 14:45:25 51913 [Note] WSREP: Deleted page /xfs/mysql/gcache.page.000060
2016/04/15 14:50:41 socat[2700] E write(3, 0x7fac95c3a000, 8192): Broken pipe
WSREP_SST: [ERROR] innobackupex finished with error: 1. Check /xfs/mysql//innobackup.backup.log (20160415 14:50:41.218)
WSREP_SST: [ERROR] Cleanup after exit with status:22 (20160415 14:50:41.228)
WSREP_SST: [INFO] Cleaning up temporary directories (20160415 14:50:41.248)
2016-04-15 14:50:41 51913 [ERROR] WSREP: Failed to read from: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.5.19:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/xfs/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid 'd4031c0e-f4de-11e5-b0bb-dbe7abf57cb5:7984927'
2016-04-15 14:50:41 51913 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.5.19:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/xfs/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid 'd4031c0e-f4de-11e5-b0bb-dbe7abf57cb5:7984927': 22 (Invalid argument)
2016-04-15 14:50:41 51913 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.5.19:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/xfs/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid 'd4031c0e-f4de-11e5-b0bb-dbe7abf57cb5:7984927'
2016-04-15 14:50:41 51913 [Warning] WSREP: 1.0 (): State transfer to 0.0 () failed: -22 (Invalid argument)
2016-04-15 14:50:41 51913 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 8123961)
Donnor innobackup.backup.log
160415 14:50:34 >> log scanned up to (6403744762043)
160415 14:50:35 >> log scanned up to (6403759388831)
160415 14:50:36 >> log scanned up to (6403769515297)
160415 14:50:37 >> log scanned up to (6403773073546)
160415 14:50:38 >> log scanned up to (6403776031409)
160415 14:50:39 >> log scanned up to (6403785481586)
160415 14:50:40 >> log scanned up to (6403791489704)
innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
xb_stream_write_data() failed.
innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
[01] xtrabackup: Error: xtrabackup_copy_datafile() failed.
[01] xtrabackup: Error: failed to copy datafile.
My Environment
OS: CentOS 7.2.1511
Package: ( all the same with donor and joiner )
[root@mysql02 ~]# rpm -qa |grep -i percona
percona-xtrabackup-2.3.4-1.el7.x86_64
Percona-XtraDB-Cluster-server-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-client-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-galera-3-3.14-1.rhel7.x86_64
percona-release-0.1-3.noarch
Percona-XtraDB-Cluster-shared-56-5.6.28-25.14.1.el7.x86_64
I’m having exactly the same issue at the moment, but what I found is that if I use “rsync” as my SST for initial DB replication and once the node is synced, I can flip to “xtrabackup-v2” and things are fine after that…
Got hit with exact same problem. It seems the issue appears if databases are larger than a 1TB. I have 1.6TB on the donor trying to SST to joiner.
The process exits with the same error as topic starter:
JOINER:
Removed directory: ���/var/lib/mysql/performance_schema���
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20160918 17:19:57.001)
2016-09-18 17:19:58 26196 [Note] WSREP: (a4fbcade, 'tcp://0.0.0.0:4567') turning message relay requesting off
2016-09-18 18:55:03 26196 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.12864S), skipping check
2016-09-18 18:55:34 26196 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.63051S), skipping check
2016-09-18 19:15:56 26196 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.57793S), skipping check
2016-09-18 19:16:26 26196 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.99439S), skipping check
WSREP_SST: [ERROR] Removing /var/lib/mysql//.sst/xtrabackup_galera_info file due to signal (20160918 19:49:28.648)
WSREP_SST: [ERROR] Cleanup after exit with status:143 (20160918 19:49:28.650)
DONOR:
2016/09/18 19:49:28 socat[29871] E write(3, 0x9107f0, 8192): Broken pipe
WSREP_SST: [ERROR] innobackupex finished with error: 1. Check /var/lib/mysql//innobackup.backup.log (20160918 19:49:28.663)
WSREP_SST: [ERROR] Cleanup after exit with status:22 (20160918 19:49:28.665)
To whoever is interested in this i found the problem.
/usr/bin/mysql-systemd hardcodes timeout parameters. On line 96:
while [[ $i -lt $service_startup_timeout ]]; do
The control process waits for pid file which gets created only when mysqld successfully starts. service_startup_timeout is set to 900 seconds elsewhere in the script, unless overriden by some my.cnf option. Supposedly in [mysql.service] section, but mysqld-systemd never parses that, only [mysqld].
If you look in /usr/lib/systemd/system/mysql.service you will also see some indication of what the problem might be:
# Timeout is handled elsewhere
# service-startup-timeout in my.cnf
# Default is 900 seconds
TimeoutStartSec=0
Percona tries to provision for long sst transfer, but its rather dumb:
if [[ $verb = 'created' ]];then
if ([[ -e $sst_progress_file ]] || grep -q -- '--wsrep-new-cluster' <<< "$env_args" ) \
&& [[ $startup_sleep -ne 10 ]];then
echo "State transfer in progress, setting sleep higher"
startup_sleep=10
fi
fi
The above means if there’s SST transfer, the script will just sleep for 10 seconds every second, effectively increasing timeout to 9000 which is 2hr and 30 minutes.
On 1Gbps connection this will transfer about 1TB of data before it forcefully kills the SST.
This explains rellik’s and my problems. We both had over 1TB of data.
For me it worked when I ensured there is enough free space for innobackup move step
It was failing at this step.
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/my.cnf --defaults-group=mysqld --no-version-check --datadir=/var/lib/mysql/data/ --move-back --force-non-empty-directories ${DATA} &>${DATA}/innobackup.move.log (20191211 08:19:00.673)
WSREP_SST: [ERROR] Cleanup after exit with status:1 (20191211 08:19:00.702)
Same error occurred when I tried to join another node. All nodes have same disk space of 1.5 TB for mysql data. The mysql data size was around 848G + 10G UNDO files. That left
nearly 640GB of free space. Seems like move fails due to insufficient space. I increased it to 2TB on all nodes and SST was successful.