Xtradb cluster keeps failing to join cluster

Hi ,
I have faced issue the new setup xtradb cluster joiner keeps failing.

I checked joiner “.sst” directory
[root@mysql02 .sst]# du -sh
982G .
It’s syncing data at the beginnnig , but failed on the way to cluster. ( donor has 1.3TB data )

There’s the error :

Joiner error :
2016-04-15 14:49:00 3688 [Note] WSREP: Created page /xfs/mysql/gcache.page.000141 of size 134217728 bytes
2016-04-15 14:49:37 3688 [Note] WSREP: Created page /xfs/mysql/gcache.page.000142 of size 134217728 bytes
WSREP_SST: [ERROR] Removing /xfs/mysql//.sst/xtrabackup_galera_info file due to signal (20160415 14:49:59.564)
WSREP_SST: [ERROR] Cleanup after exit with status:143 (20160415 14:49:59.570)
2016-04-15 14:49:59 3688 [ERROR] WSREP: Process was aborted.
2016-04-15 14:49:59 3688 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.0.5.19' --datadir '/xfs/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '3688' '' : 10 (No child processes)
2016-04-15 14:49:59 3688 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2016-04-15 14:49:59 3688 [ERROR] WSREP: SST script aborted with error 10 (No child processes)
2016-04-15 14:49:59 3688 [ERROR] WSREP: SST failed: 10 (No child processes)
2016-04-15 14:49:59 3688 [ERROR] Aborting

2016-04-15 14:49:59 3688 [Warning] WSREP: 1.0 (): State transfer to 0.0 () failed: -22 (Invalid argument)
2016-04-15 14:49:59 3688 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():733: Will never receive state. Need to abort.
2016-04-15 14:49:59 3688 [Note] WSREP: gcomm: terminating thread
2016-04-15 14:49:59 3688 [Note] WSREP: gcomm: joining thread
2016-04-15 14:49:59 3688 [Note] WSREP: gcomm: closing backend
2016-04-15 14:50:00 3688 [Note] WSREP: view(view_id(NON_PRIM,66a7dc34,12) memb {
66a7dc34,0
} joined {
} left {
} partitioned {
a6827774,0
})
2016-04-15 14:50:00 3688 [Note] WSREP: view((empty))
2016-04-15 14:50:00 3688 [Note] WSREP: gcomm: closed
2016-04-15 14:50:00 3688 [Note] WSREP: /usr/sbin/mysqld: Terminated.
160415 14:50:01 mysqld_safe mysqld from pid file /xfs/mysql/mysql02.pid ended

Donor error:

WSREP_SST: [INFO] Evaluating xbstream -c ${INFO_FILE} | pigz | socat -u stdio TCP:10.0.5.19:4444 ; RC=( ${PIPESTATUS[@]} ) (20160415 12:21:19.695)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20160415 12:21:19.718)
2016-04-15 12:21:20 51913 [Note] WSREP: (a6827774, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [INFO] Streaming the backup to joiner at 10.0.5.19 4444 (20160415 12:21:29.735)
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/my.cnf --defaults-group=mysqld --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | pigz | socat -u stdio TCP:10.0.5.19:4444; RC=( ${PIPESTATUS[@]} ) (20160415 12:21:29.747)
2016-04-15 13:45:22 51913 [Note] WSREP: Created page /xfs/mysql/gcache.page.000059 of size 134217728 bytes
2016-04-15 13:45:25 51913 [Note] WSREP: Deleted page /xfs/mysql/gcache.page.000059
2016-04-15 14:45:22 51913 [Note] WSREP: Created page /xfs/mysql/gcache.page.000060 of size 134217728 bytes
2016-04-15 14:45:25 51913 [Note] WSREP: Deleted page /xfs/mysql/gcache.page.000060
2016/04/15 14:50:41 socat[2700] E write(3, 0x7fac95c3a000, 8192): Broken pipe
WSREP_SST: [ERROR] innobackupex finished with error: 1. Check /xfs/mysql//innobackup.backup.log (20160415 14:50:41.218)
WSREP_SST: [ERROR] Cleanup after exit with status:22 (20160415 14:50:41.228)
WSREP_SST: [INFO] Cleaning up temporary directories (20160415 14:50:41.248)
2016-04-15 14:50:41 51913 [ERROR] WSREP: Failed to read from: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.5.19:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/xfs/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid 'd4031c0e-f4de-11e5-b0bb-dbe7abf57cb5:7984927'
2016-04-15 14:50:41 51913 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.5.19:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/xfs/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid 'd4031c0e-f4de-11e5-b0bb-dbe7abf57cb5:7984927': 22 (Invalid argument)
2016-04-15 14:50:41 51913 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.5.19:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/xfs/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid 'd4031c0e-f4de-11e5-b0bb-dbe7abf57cb5:7984927'
2016-04-15 14:50:41 51913 [Warning] WSREP: 1.0 (): State transfer to 0.0 () failed: -22 (Invalid argument)
2016-04-15 14:50:41 51913 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 8123961)

Donnor innobackup.backup.log


160415 14:50:34 >> log scanned up to (6403744762043)
160415 14:50:35 >> log scanned up to (6403759388831)
160415 14:50:36 >> log scanned up to (6403769515297)
160415 14:50:37 >> log scanned up to (6403773073546)
160415 14:50:38 >> log scanned up to (6403776031409)
160415 14:50:39 >> log scanned up to (6403785481586)
160415 14:50:40 >> log scanned up to (6403791489704)
innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
xb_stream_write_data() failed.
innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
[01] xtrabackup: Error: xtrabackup_copy_datafile() failed.
[01] xtrabackup: Error: failed to copy datafile.

My Environment
OS: CentOS 7.2.1511
Package: ( all the same with donor and joiner )
[root@mysql02 ~]# rpm -qa |grep -i percona
percona-xtrabackup-2.3.4-1.el7.x86_64
Percona-XtraDB-Cluster-server-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-client-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-galera-3-3.14-1.rhel7.x86_64
percona-release-0.1-3.noarch
Percona-XtraDB-Cluster-shared-56-5.6.28-25.14.1.el7.x86_64

any suggestion ?

Donor node my.cnf


# Template my.cnf for PXC
[mysqld]
back_log=65535
binlog_format=ROW
bulk_insert_buffer_size=8M
character_set_server=utf8
collation_server=utf8_unicode_ci
connect_timeout=15
datadir=/xfs/mysql
default_storage_engine=InnoDB
expire_logs_days=5
interactive_timeout=900
join_buffer_size=1M
key_buffer_size=16M
log_bin=db-bin
log_error=/var/log/mysql/mysql.log
long_query_time=1
max_allowed_packet=16M
max_connect_errors=4294967295
max_connections=5000
max_heap_table_size=16MB
max_prepared_stmt_count=1048576
myisam_recover_options
myisam_repair_threads=1
myisam_sort_buffer_size=8M
open_files_limit=96000
query_cache_limit=2M
query_cache_min_res_unit=6144
query_cache_size=0
query_cache_type=0
read_buffer_size=2M
read_rnd_buffer_size=1M
server_id=35
skip-federated
skip_name_resolve=ON
slow_query_log=1
slow_query_log_file=/var/log/mysql/mysql-slow.log
sort_buffer_size=1M
sql_mode=STRICT_ALL_TABLES
sync_binlog=0
table_open_cache=96000
table_open_cache_instances=16
thread_cache_size=16384
thread_handling=pool-of-threads
tmp_table_size=16M
transaction_isolation=READ-COMMITTED
wait_timeout=900
# InnoDB
innodb_adaptive_hash_index=OFF
innodb_additional_mem_pool_size=8M
innodb_autoinc_lock_mode=2
innodb_buffer_pool_dump_at_shutdown=ON
innodb_buffer_pool_instances=8
innodb_buffer_pool_load_at_startup=ON
innodb_buffer_pool_populate=1
innodb_buffer_pool_size=40G
innodb_checksum_algorithm=crc32
innodb_data_file_path=ibdata1:10M:autoextend
innodb_file_format=Barracuda
innodb_file_format_max=Barracuda
innodb_file_per_table=1
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT
innodb_io_capacity=800
innodb_large_prefix=1
innodb_lock_wait_timeout=120
innodb_log_buffer_size=16M
innodb_log_file_size=1G
innodb_log_files_in_group=2
innodb_max_dirty_pages_pct=90
innodb_print_all_deadlocks=ON
innodb_purge_threads=2
innodb_read_io_threads=16
innodb_support_xa=0
innodb_thread_concurrency=0
innodb_write_io_threads=8
# WSREP
bind-address=0.0.0.0
wsrep_cluster_address="gcomm://10.0.5.30,10.0.5.35,10.0.5.19"
wsrep_cluster_name="DBreporting"
wsrep_max_ws_rows=393216
wsrep_max_ws_size=3221225472
wsrep_provider=/usr/lib64/galera3/libgalera_smm.so
wsrep_slave_threads=8
wsrep_sst_auth=mysqluser:mysqlpass
wsrep_sst_method=xtrabackup-v2
# SST
[sst]
compressor="pigz"
decompressor="pigz -d"
inno-apply-opts="--use-memory=10G"

Joiner my.cnf

# Template my.cnf for PXC
[mysqld]
back_log=65535
binlog_format=ROW
bulk_insert_buffer_size=8M
character_set_server=utf8
collation_server=utf8_unicode_ci
connect_timeout=15
datadir=/xfs/mysql
default_storage_engine=InnoDB
expire_logs_days=5
interactive_timeout=900
join_buffer_size=1M
key_buffer_size=16M
log_bin=db-bin
log_error=/var/log/mysql/mysql.log
long_query_time=1
max_allowed_packet=16M
max_connect_errors=4294967295
max_connections=5000
max_heap_table_size=16MB
max_prepared_stmt_count=1048576
myisam_recover_options
myisam_repair_threads=1
myisam_sort_buffer_size=8M
open_files_limit=96000
query_cache_limit=2M
query_cache_min_res_unit=6144
query_cache_size=0
query_cache_type=0
read_buffer_size=2M
read_rnd_buffer_size=1M
server_id=19
skip-federated
skip_name_resolve=ON
slow_query_log=1
slow_query_log_file=/var/log/mysql/mysql-slow.log
sort_buffer_size=1M
sql_mode=STRICT_ALL_TABLES
sync_binlog=0
table_open_cache=96000
table_open_cache_instances=16
thread_cache_size=16384
thread_handling=pool-of-threads
tmp_table_size=16M
transaction_isolation=READ-COMMITTED
wait_timeout=900
# InnoDB
innodb_adaptive_hash_index=OFF
innodb_additional_mem_pool_size=8M
innodb_autoinc_lock_mode=2
innodb_buffer_pool_dump_at_shutdown=ON
innodb_buffer_pool_instances=8
innodb_buffer_pool_load_at_startup=ON
innodb_buffer_pool_populate=1
innodb_buffer_pool_size=40G
innodb_checksum_algorithm=crc32
innodb_data_file_path=ibdata1:10M:autoextend
innodb_file_format=Barracuda
innodb_file_format_max=Barracuda
innodb_file_per_table=1
innodb_flush_log_at_trx_commit=2
innodb_flush_method=O_DIRECT
innodb_io_capacity=800
innodb_large_prefix=1
innodb_lock_wait_timeout=120
innodb_log_buffer_size=16M
innodb_log_file_size=1G
innodb_log_files_in_group=2
innodb_max_dirty_pages_pct=90
innodb_print_all_deadlocks=ON
innodb_purge_threads=2
innodb_read_io_threads=16
innodb_support_xa=0
innodb_thread_concurrency=0
innodb_write_io_threads=8
# WSREP
bind-address=0.0.0.0
wsrep_cluster_address="gcomm://10.0.5.19,10.0.5.30,10.0.5.35"
wsrep_cluster_name="DBreporting"
wsrep_max_ws_rows=393216
wsrep_max_ws_size=3221225472
wsrep_node_address="10.0.5.19"
wsrep_provider=/usr/lib64/galera3/libgalera_smm.so
wsrep_slave_threads=8
wsrep_sst_auth=mysqluser:mysqlpass
wsrep_sst_method=xtrabackup-v2
# SST
[sst]
compressor="pigz"
decompressor="pigz -dc"
inno-apply-opts="--use-memory=10G"

have you opened ports 3306, 4444, 4567, 4568 on both hosts?

yes, all port opend , actually all server hidden in nat network without firewall

Update:
I tried setup a clean machine on AWS but get the same result, syncing in the beginning but failed in the middle way.

joiner error

removed ‘/xfs/mysql/performance_schema/events_statements_summary_global_by_event_name.frm’
removed ‘/xfs/mysql/performance_schema/hosts.frm’
removed ‘/xfs/mysql/performance_schema/users.frm’
removed ‘/xfs/mysql/performance_schema/accounts.frm’
removed ‘/xfs/mysql/performance_schema/events_statements_summary_by_digest.frm’
removed ‘/xfs/mysql/performance_schema/session_connect_attrs.frm’
removed ‘/xfs/mysql/performance_schema/session_account_connect_attrs.frm’
removed directory: ‘/xfs/mysql/performance_schema’
removed ‘/xfs/mysql/db-bin.000002’
removed ‘/xfs/mysql/db-bin.index’
removed ‘/xfs/mysql/ib_buffer_pool’
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20160418 10:42:46.904)
2016-04-18 10:42:48 377 [Note] WSREP: (3ae9a43e, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [ERROR] Removing /xfs/mysql//.sst/xtrabackup_galera_info file due to signal (20160418 13:12:17.843)
WSREP_SST: [ERROR] Cleanup after exit with status:143 (20160418 13:12:17.849)
2016-04-18 13:12:17 377 [ERROR] WSREP: Process was aborted.
2016-04-18 13:12:17 377 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.5.1.64' --datadir '/xfs/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '377' '' : 10 (No child processes)
2016-04-18 13:12:17 377 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2016-04-18 13:12:17 377 [ERROR] WSREP: SST script aborted with error 10 (No child processes)
2016-04-18 13:12:17 377 [ERROR] WSREP: SST failed: 10 (No child processes)
2016-04-18 13:12:17 377 [ERROR] Aborting

2016-04-18 13:12:18 377 [Warning] WSREP: 1.0 (mysql-reporting01): State transfer to 0.0 (mysql-reporting02) failed: -22 (Invalid argument)
2016-04-18 13:12:18 377 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():733: Will never receive state. Need to abort.
2016-04-18 13:12:18 377 [Note] WSREP: gcomm: terminating thread
2016-04-18 13:12:18 377 [Note] WSREP: gcomm: joining thread
2016-04-18 13:12:18 377 [Note] WSREP: gcomm: closing backend
2016-04-18 13:12:19 377 [Note] WSREP: view(view_id(NON_PRIM,3ae9a43e,4) memb {
3ae9a43e,0
} joined {
} left {
} partitioned {
fd77376e,0
})

Did you get this fixed? If so, how?

I’m having exactly the same issue at the moment, but what I found is that if I use “rsync” as my SST for initial DB replication and once the node is synced, I can flip to “xtrabackup-v2” and things are fine after that…

Ideas, suggestions?

BTW, if I change sst_method to

unfortunally , It’s not fixed. I give up this cluster now . Running traditional master slave .
I tried rsync it’s still failed.

Got hit with exact same problem. It seems the issue appears if databases are larger than a 1TB. I have 1.6TB on the donor trying to SST to joiner.
The process exits with the same error as topic starter:
JOINER:


Removed directory: ���/var/lib/mysql/performance_schema���
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20160918 17:19:57.001)
2016-09-18 17:19:58 26196 [Note] WSREP: (a4fbcade, 'tcp://0.0.0.0:4567') turning message relay requesting off
2016-09-18 18:55:03 26196 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.12864S), skipping check
2016-09-18 18:55:34 26196 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.63051S), skipping check
2016-09-18 19:15:56 26196 [Warning] WSREP: last inactive check more than PT1.5S ago (PT2.57793S), skipping check
2016-09-18 19:16:26 26196 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.99439S), skipping check
WSREP_SST: [ERROR] Removing /var/lib/mysql//.sst/xtrabackup_galera_info file due to signal (20160918 19:49:28.648)
WSREP_SST: [ERROR] Cleanup after exit with status:143 (20160918 19:49:28.650)


DONOR:


2016/09/18 19:49:28 socat[29871] E write(3, 0x9107f0, 8192): Broken pipe
WSREP_SST: [ERROR] innobackupex finished with error: 1. Check /var/lib/mysql//innobackup.backup.log (20160918 19:49:28.663)
WSREP_SST: [ERROR] Cleanup after exit with status:22 (20160918 19:49:28.665)

innobackup.backup.log:


160918 19:49:27 >> log scanned up to (2649199442061)
innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
xb_stream_write_data() failed.
innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
[01] xtrabackup: Error: xtrabackup_copy_datafile() failed.
[01] xtrabackup: Error: failed to copy datafile.


To whoever is interested in this i found the problem.
/usr/bin/mysql-systemd hardcodes timeout parameters. On line 96:

while [[ $i -lt $service_startup_timeout ]]; do

The control process waits for pid file which gets created only when mysqld successfully starts. service_startup_timeout is set to 900 seconds elsewhere in the script, unless overriden by some my.cnf option. Supposedly in [mysql.service] section, but mysqld-systemd never parses that, only [mysqld].

If you look in /usr/lib/systemd/system/mysql.service you will also see some indication of what the problem might be:

# Timeout is handled elsewhere
# service-startup-timeout in my.cnf
# Default is 900 seconds
TimeoutStartSec=0

Percona tries to provision for long sst transfer, but its rather dumb:


if [[ $verb = 'created' ]];then
if ([[ -e $sst_progress_file ]] || grep -q -- '--wsrep-new-cluster' <<< "$env_args" ) \
&& [[ $startup_sleep -ne 10 ]];then
echo "State transfer in progress, setting sleep higher"
startup_sleep=10
fi
fi

The above means if there’s SST transfer, the script will just sleep for 10 seconds every second, effectively increasing timeout to 9000 which is 2hr and 30 minutes.
On 1Gbps connection this will transfer about 1TB of data before it forcefully kills the SST.
This explains rellik’s and my problems. We both had over 1TB of data.

You can also look into the DONOR server logs. There might be some more info in there.
In my case it was:

All I had to do is increase the limit on amount of open files on DONOR.

Hi dcherniv - I would highly recommend filing a bug for this.

For me it worked when I ensured there is enough free space for innobackup move step

It was failing at this step.
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/my.cnf --defaults-group=mysqld --no-version-check --datadir=/var/lib/mysql/data/ --move-back --force-non-empty-directories ${DATA} &>${DATA}/innobackup.move.log (20191211 08:19:00.673)
WSREP_SST: [ERROR] Cleanup after exit with status:1 (20191211 08:19:00.702)

Same error occurred when I tried to join another node. All nodes have same disk space of 1.5 TB for mysql data. The mysql data size was around 848G + 10G UNDO files. That left
nearly 640GB of free space. Seems like move fails due to insufficient space. I increased it to 2TB on all nodes and SST was successful.