Donor replication queue overflow during SST

We recently upgraded from XtraDB 5.6 to 5.7. After the upgrade, however, we noticed an issue when bringing up a node with SST: the donor node replication queue backs up during an SST. Looking at the process list on the donor node it slowly fills up with “” entries. Eventually the donor reaches max connections and things start to really fall apart. We did not notice this behavior with 5.6.

Here is a screenshot from ‘PXC/Galera Graphs’ on the donor node:

Here is a screenshot from ‘PXC/Galera Graphs’ on the second database node (not the donor):

Here is a screenshot from ‘PXC/Galera Cluster Overview’ :

Here are the contents of :

xtrabackup: recognized server arguments: --server-id=2 --datadir=/var/lib/mysql --log_bin=mysql-bin --innodb_flush_log_at_trx_commit=2 --innodb_file_per_table=1 --innodb_flush_method=O_DIRECT --innodb_log_fil
e_size=50G --innodb_log_files_in_group=2 --innodb_log_buffer_size=512M --innodb_buffer_pool_size=433182M --innodb_read_io_threads=32 --innodb_write_io_threads=32 --innodb_io_capacity=5000 --tmpdir=/dev/shm --
defaults_group=mysqld --parallel=4
xtrabackup: recognized client arguments: --server-id=2 --datadir=/var/lib/mysql --log_bin=mysql-bin --innodb_flush_log_at_trx_commit=2 --innodb_file_per_table=1 --innodb_flush_method=O_DIRECT --innodb_log_fil
e_size=50G --innodb_log_files_in_group=2 --innodb_log_buffer_size=512M --innodb_buffer_pool_size=433182M --innodb_read_io_threads=32 --innodb_write_io_threads=32 --innodb_io_capacity=5000 --tmpdir=/dev/shm --
defaults_group=mysqld --parallel=4 --target-dir=/backup --socket=/var/lib/mysql/mysql.sock --binlog-info=LOCKLESS --user=cluster_username --password=* --socket=/var/lib/mysql/mysql.sock --lock-ddl=1 --backup=
1 --galera-info=1 --binlog-info=ON --stream=xbstream --target-dir=/dev/shm/pxc_sst_BGX1/donor_xb_FzlZ
181026 10:04:33 version_check Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_group=xtrabackup;mysql_socket=/var/lib/mysql/mysql.sock' as 'cluster_username' (using password: YES).
181026 10:04:33 version_check Connected to MySQL server
181026 10:04:33 version_check Executing a version check against the server...
181026 10:04:33 version_check Done.
181026 10:04:33 Connecting to MySQL server host: localhost, user: cluster_username, password: set, port: not set, socket: /var/lib/mysql/mysql.sock
Using server version 5.7.23-23-57-log
181026 10:04:33 Executing LOCK TABLES FOR BACKUP...
xtrabackup version 2.4.12 based on MySQL server 5.7.19 Linux (x86_64) (revision id: 170eb8c)
xtrabackup: uses posix_fadvise().
xtrabackup: cd to /var/lib/mysql
xtrabackup: open files limit requested 0, set to 32011
xtrabackup: using the following InnoDB configuration:
xtrabackup: innodb_data_home_dir = .
xtrabackup: innodb_data_file_path = ibdata1:12M:autoextend
xtrabackup: innodb_log_group_home_dir = ./
xtrabackup: innodb_log_files_in_group = 2
xtrabackup: innodb_log_file_size = 53687091200
xtrabackup: using O_DIRECT
InnoDB: Number of pools: 1
xtrabackup: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
xb_stream_write_data() failed.
xtrabackup: Error: write to logfile failed
xtrabackup: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
xtrabackup: Error: xtrabackup_copy_logfile() failed.

Here are the contents of :

[client]
socket=/var/lib/mysql/mysql.sock

[mysqld]
server-id=2
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
log-bin=mysql-bin
log_slave_updates
expire_logs_days=2
symbolic-links=0
user=mysql
sql_mode = ONLY_FULL_GROUP_BY,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
max_allowed_packet=512M
innodb_flush_log_at_trx_commit=2
max_connections=2000
table_open_cache=15000
query_cache_size=0
query_cache_type=0
innodb_file_per_table
innodb_file_format=barracuda
innodb_flush_method=O_DIRECT
innodb_log_file_size=50G
innodb_log_files_in_group=2
innodb_log_buffer_size=512M
innodb_buffer_pool_size=433182M
innodb_buffer_pool_instances=64
innodb_read_io_threads=32
innodb_write_io_threads=32
innodb_io_capacity = 5000
innodb_thread_concurrency=0
max_connect_errors=4294967295
max_heap_table_size=67108864
tmp_table_size=67108864
max_heap_table_size=67108864
key_buffer_size=7735M
tmpdir=/dev/shm
slow_query_log_file=/var/log/mysql-slow.log
log-queries-not-using-indexes
log_output=file
slow_query_log=ON
long_query_time=0
log_slow_rate_limit=100
log_slow_rate_type=query
log_slow_verbosity=full
log_slow_admin_statements=ON
log_slow_slave_statements=ON
slow_query_log_always_write_time=1
slow_query_log_use_global_control=all
innodb_monitor_enable=all
userstat=1
performance_schema=ON
skip_name_resolve

Here are the contents of :

[mysqld_safe]
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/lib/mysql/mysql.sock
nice = 0

Here are the contents of :

[mysqld]
wsrep_provider=/usr/lib64/galera3/libgalera_smm.so
wsrep_cluster_address=gcomm://192.168.100.236,192.168.100.237,192.168.100.238
binlog_format=ROW
default_storage_engine=InnoDB
wsrep_slave_threads=128
wsrep_provider_options="gcs.fc_limit=640;"
innodb_autoinc_lock_mode=2
wsrep_node_address=192.168.100.236
wsrep_cluster_name=prod
wsrep_node_name=d1
pxc_strict_mode=ENFORCING
wsrep_sst_method=xtrabackup-v2
wsrep_sst_auth="cluster_username:password"

[sst]
rlimit=80m
inno-backup-opts="--binlog-info=LOCKLESS"

We are a bit stumped as to why the donor node was fine during SST in 5.6 but is having problems now in 5.7. Any ideas?

Hello Victor, thanks for your question … are you still struggling with this one?

Could you let me have a few pieces of info and then I will see if I can get someone to take a look.
[LIST]
[]What OS are you running on (version etc).
[
]Are there any error logs being written? (apart from the errors to the innobackup.backup.log)
[*]What sub-versions of Percona XtraDB Cluster 5.6 and 5.7 were you upgrading from and to, please?
[/LIST] Although these are different versions, I found a past post that seems to be of some relevance… don’t act on it, though, unless you are sure that it matches your scenario. Perhaps just check it out first and report back? [URL=“https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster/39187-nodes-don-t-join-after-upgrade-to-xtradb-cluster-5-6-22-72-0-and-xtrabackup-2-2-10”]https://www.percona.com/forums/quest...abackup-2-2-10[/URL]

Also, please be aware that on the open source forum we’re only really able to help with ‘generic’ type issues. If something’s really peculiar to your install and environment, then it might need our professional engagement to resolve the issue.

But let’s see, first, where we get to with the generalist review first. THANKS!

PS Apologies if I ask for info that’s obvious from what you’ve sent, I’m kind of “mildly” tech, so might have overlooked something :slight_smile:

[LIST]
[]We are running CentOS 7.4
[
]There are no errors in any logs that I can see. The errors in innobackup.backup.log are only there because we killed the SST process before it finished.
[*]I’m not entirely sure what version of 5.6 we were on before (how can we check?). We upgraded to 5.7.23-23.
[/LIST]

Hello victor the team have taken a look and have a couple of things for you to feedback on if you could, please.

The first of these is that the ‘broken pipe’ message would suggest a network issue. That wouldn’t explain why it was working on 5.6 and not on 5.7 but it’s a first query from here. That might be just as you say because you interrupted the process.

The next line of enquiry is that Percona XtraBackup may have died on the receiver side. If you could maybe batch up all logs including the XtraBackup logs from the receiver node, that might shine a light on the issue.

Could you check these things out please?

The ‘Broken pipe’ error is from when we manually killed the SST process.

XtraBackup is fine on the receiver side as well I believe.

The SST process works. After about 1.5 hours our 3rd database did come online – it’s just that the donor node could not do anything while it was being a donor.

Hello Victor,

Did you capture more info from the donor while it was stuck during SST? Particularly I am interested in show processlist and show engine innodb status\G
Can you upload the error log from both donor and joiner covering the SST time?

Przemysław,

Thank you for your response, here is the information you requested:

Donor

The majority of the processes are like this:

Id: 23
User: system user
Host: 
db: NULL
Command: Sleep
Time: 456
State: wsrep: committing write set (1979129617)
Info: NULL
Rows_sent: 0
Rows_examined: 0

One of the more interesting ones is this:

Id: 118
User: system user
Host: 
db: database_main
Command: Sleep
Time: 457
State: Waiting for backup lock
Info: CREATE TABLE IF NOT EXISTS tmp_table_1541710320 (
item_tid INT UNSIGNED NOT NULL DEFAULT '0',
num INT UNSIGNED NOT NULL DEFAULT '0',
PRIMARY KEY (item_id)
) ENGINE = InnoDB
Rows_sent: 0
Rows_examined: 0

I’ve attached the full process list in the zip file as .

Attached in zip file as .

[B]

Attached in zip file as .

Joiner

[B]

Attached in zip file as .

logs.zip (76.8 KB)

This turned out to be a bug in 5.7. This issue is being dealt with here: [url][PXC-2365] DDL during SST blocks Donor - Percona JIRA

Thanks for that update Victor, appreciated.

Have you had any success in solving this? We are experiencing exactly the same thing when creating a temp table during SST. It’s impossible to get the nodes back up. Will have to change the application to prevent temp table creation while this is going on. The strange thing is we have not upgraded any of the packages, but this is happening now for the first time.

tucj7 said: Have you had any success in solving this? We are experiencing exactly the same thing when creating a temp table during SST. It's impossible to get the nodes back up. Will have to change the application to prevent temp table creation while this is going on. The strange thing is we have not upgraded any of the packages, but this is happening now for the first time.
Nope, no luck. We gave up and moved to MySQL InnoDB Cluster after we paid Percona a few grand to diagnose this issue.
victor said:
tucj7 said: Have you had any success in solving this? We are experiencing exactly the same thing when creating a temp table during SST. It's impossible to get the nodes back up. Will have to change the application to prevent temp table creation while this is going on. The strange thing is we have not upgraded any of the packages, but this is happening now for the first time.
Nope, no luck. We gave up and moved to MySQL InnoDB Cluster after we paid Percona a few grand to diagnose this issue.
Hi Victor,
Wanted to follow up on this message. Needless to say, we expect that any customer that engages our services be completely satisfied (by of introduction, I am responsible for our Service Delivery here at Percona, so this is my direct responsibility!)
I'll reach out to you directly to get more information on the experience you had. Needless to say, I apologize in advance that it appears we did not meet your expectations.
Cheers,
Bennie