Resolving IST Utilization Issues in a Percona XtraDB Cluster and Adding a New Node

Hello All,

In my current setup, I’m managing a two-node Percona XtraDB Cluster on Ubuntu 16.04.6, running MySQL - mysql Ver 14.14 Distrib 5.7.23-23, along with xtrabackup version 2.4.12, based on MySQL server 5.7.19. However, I’ve encountered a persistent problem. Whenever there’s a failure or an unexpected shutdown in MySQL, the affected node fails to utilize IST (Incremental State Transfer) and defaults to using SST (State Snapshot Transfer). This results in data deletion on the node, leading to time-consuming recovery from scratch. Given the substantial 3TB of data, I’m looking for ways to recover a node using IST or force its utilization. I’m also interested in additional manual steps that could assist in promoting IST after a crash or ungraceful MySQL shutdown.

To address this issue, I’m in the process of adding a third node to the cluster, following best practices in a Percona cluster. This will enable me to experiment with various recovery mechanisms without jeopardizing the two nodes responsible for the main production workload. However, this addition has caused a significant spike in Memory Usage of MySQL on the master node, even though the master isn’t the donor for the newly added node. I had to shut down the extra node due to memory constraints on the master. My primary goal is to find a solution to add a new node without risking the current workloads.

I would appreciate your insights and assistance in resolving these challenges I’m encountering.

Just so you know, in 8 days (Oct 31 2023), 5.7 dies. No more support for this version. You need to upgrade as soon as you can, otherwise you won’t be able to get any help.

That is dangerous. You should always have a minimum of 3 nodes. 2 nodes == split brain.

I’ve never understood why this isn’t enabled by default. You should add this to all my.cnf files.

wsrep_provider_options = "gcache.size=2G;gcache.recover=yes"

first of all I’m thankful for your fast response, according to your reply.

I’m aware of this, this is a main reason for trying to setup an additional node, yet I’m encountering memory spikes when it starts syncing, basically the 2nd node status in SQLProxy is changed to OFFLINE_SOFT and as a DONOR it stops serving requests, this leaves the master node as the only functioning node, I don’t want to risk current workload with an OOM in the master node, do you have any suggestion to lower this risk or how to overcome this obstacle ?

my servers are running on hosts each with 16 CPU and 32 GB of Memory.

is it possible set this variables while running MySQL ? does it requires to restart MySQL server ?

I’m not sure if those specific gcache parameters can be set dynamically. I would recommend a restart.

Yes, that is expected.

I’m unsure as to why node1 would see increase memory while node2 is serving as DONOR.