Replicating into cluster is slow

I am migrating from a system running master-slave replication with a master and 5 slave servers. I have set up a cluster with 6 nodes and am replicating from the original master into the new cluster in preparation for switching over to the new cluster. The problem I have is the replication into the cluster seems to be slow and it is difficult to catch up when there are more than 4 nodes running. When I had the 5th node it gets behind when the server is added and then doesn’t seem to catch up. Is there something that can be done to improve this performance so that I can actually contemplate adding the 6th node?

The master-slave system is running MySQL 5.7 and the new cluster is running XtraDB 5.7.

Hello @Tom-sl , I would advise against 6 nodes due to split-brain issues. Keep the number of nodes ‘odd’, either 3 or 5. Understand that the fastest transaction speed of PXC is limited by the slowest member of your cluster. If one of your nodes is in a different data center, all other nodes must wait for that node to acknowledge replication events.

Make sure all PXC nodes are the same specification (CPU/RAM/DISK). Have you set up Percona Monitoring? You should do this so you can identify where the slowdown is happening.

Are you using ROW-based replication? Do you have parallel replication enabled?

All the servers are in the same datacenter and are the same spec. I made a few config changes which seemed to improve things a bit, but when I try switching applications to read from the servers, the replication slows down again. PMM is set up, though I’m not an expert so am not 100% sure what to look for with this particular issue. It doesn’t seem to indicate a problem with resources as everything looks fine and doesn’t look like the servers are breaking a sweat. CPU usage increases when I make the applications read from the cluster, but no higher than on the current servers.

The servers are virtual servers and spec is as follows:

CPU 4 vCPUs

RAM 15 GB

System Disk 40 GB

Data Disk 150 GB (data and logs are all on this disk)

MySQL settings are:

[mysqld]

server-id=1

datadir=/data/var/lib/mysql

socket=/var/run/mysqld/mysqld.sock

log-error=/data/var/log/mysql/mysqld.log

pid-file=/var/run/mysqld/mysqld.pid

log_bin= /data/var/log/mysql/mysql-bin.log

log_slave_updates

expire_logs_days=2

bind-address = 0.0.0.0

sql-mode = ‘STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION’

ssl_ca=/data/var/lib/mysql/ca.pem

ssl_cert=/data/var/lib/mysql/server-cert.pem

ssl_key=/data/var/lib/mysql/server-key.pem

lower_case_table_names = 1

max_connections = 800

skip-name-resolve

innodb_file_per_table

innodb_buffer_pool_size = 10G

innodb_buffer_pool_instances = 16

innodb_io_capacity_max=8000

innodb_io_capacity=4000

innodb_log_file_size=256M

innodb_flush_log_at_trx_commit = 2

sync_binlog = 0

wsrep_slave_threads = 32

slave-skip-errors = 1062,1047

max_binlog_size = 100M

#PMM Logging

log_output=file

slow_query_log=ON

long_query_time=0

log_slow_rate_limit=100

log_slow_rate_type=query

log_slow_verbosity=full

log_slow_admin_statements=ON

log_slow_slave_statements=ON

slow_query_log_always_write_time=1

slow_query_log_use_global_control=all

innodb_monitor_enable=all

userstat=1

pxc-encrypt-cluster-traffic=ON

early-plugin-load=keyring_file.so

keyring-file-data = /var/lib/mysql-keyring/keyring

@Tom-sl Your innodb_io_capacity is tuned too high. Please lower that to 400. Also, userstat=1 can CPU intensive, you should disable that too. You only have 4 vCPUs so wsrep_slave_threads should not be larger than 4. Additionally, 16 buffer pool instances for only 10GB of buffer pool is too many. I would reduce that down to 2 instances. Lastly, I would increase innodb_log_file_size to 1G You can make those changes and simply restart MySQL to take affect.

Are all of your tables using the InnoDB engine?

Check PMM and look for various things like Disk IO maxing out, or CPU saturation maxing a single core. Look at the replication graph, are you seeing large spikes in lag? You can send a few screenshots here.

It’s very difficult to diagnose an issue like this over forums, but I’ll try to help out.

I made those changes but it seems to be slightly worse now. I will leave it running for a bit and then take some screenshots of PMM. Yes, all InnoDB tables.

Looking at the PMM graph for replication lag, you are clearly seeing lag go up and down?

Yes, here is the graph:

I have spent most of the day adding things reading from the cluster and then removing them again (the long line of nothing was me going out for lunch).

That graph is from the async master correct? It is receiving replication events from PXC? And you confirm you are running with ROW-based replication?

This graph is from the PXC server that is connected as a slave from the current MySQL master - We are migrating from standard mysql master-slave to xtradb. The replication on the master is set to mixed format, the new cluster is set to row.

Here are a few PMM screenshots from one of the pxc nodes.

The replication on the master is set to mixed format

This is probably your issue. You need to change the current master in your master/slave to use ROW. Mixed mode always uses STATEMENT unless MySQL determines the txn is non-deterministic then switches to ROW.

Change this on your master and restart mysql.

Also, on your PXC slave (the one receiving from master) set slave_parallel_workers = 4 and slave_parallel_type = LOGICAL_CLOCK

I’m not making it replicate from a slave which is using row based logs (I’m unable to change the actual master at the moment). Would this be expected to produce the same result? Due to using up all the disk it stopped replicating for a while, and is taking a very long time to catch up, far longer than I expect.

I cannot set the slave_parallel_type as it says this is unsupported.

@Tom-sl Are you on 5.7? This parameter is absolutely supported. https://dev.mysql.com/doc/refman/5.7/en/replication-options-replica.html#sysvar_slave_parallel_type

ROW-based replication is far more efficient than STATEMENT/MIXED and should help with the lag.

That should have said “I’m making it replicate from a slave which is using row based logs” - not sure why I put the not in the there. I just was unable to get it to go over around 550 writes per second.

It might be supported in MySQL 5.7, but XtraDB says it is unsupported:

ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits enabling slave_preserve_commit_order as it conflicts with galera multi-master commit order semantics

Interesting limitation from PXC. I was unaware.

Just so I can understand, is this the setup? M – MIXED → S — ROW —> PXC1

And PXC has 3 nodes? And you are watching slave lag on PXC1? What happens if you stop PXC2 and PXC3? Does lag go down?

Yes, that is the setup.

5 Nodes, removing a node does show an improvement in the replication speed.

Are you seeing any flow control issues on your PXC? Are all 5 nodes (that’s unusual, 3 is typical) on the same network? PXC cannot replicate faster than the slowest link between any two nodes. That’s why removing one sped things up. PXC synchronously replicates incoming traffic. All 5 nodes must ACK the replicated txn before anything else can happen.

I would stop all but 1 PXC. Wait for replication to catch up, then add 1 node. Monitor replication. If good, add another.

I imagine there’s something else fundamentally misconfigured in your setup, but it’s hard to diagnose over forum chat. I migrated a client in GCP last month from M/SS to a 3-node PXC. They were doing 1000’s of txns per second replicating into PXC without any lag at all.