Adding Nodes With Minimal Downtime?

Hi, I’ve been using Percona XtraDB Cluster as the clustering solution for my online service for over two years now and have been pleased with it’s ease-of-use and reliability; however, I have an issue currently that I cannot seem to resolve. I have a large multi-tenant database and I am unable to bring additional nodes online without significant disruption to my online service. This is only a problem when I have single node online that must act as the donor. I have read the articles already about how to avoid SST when bringing a new node online; however, this solution still creates a locking problem that disrupts my service.

All my databases are Innodb, which to the best of my knowledge are not locked during an SST / XtraBackup copy. However, because my database is multi-tenant, whenever a new user creates an account a new database is added to the system and the MySQL maintained databases that contain the meta-data for the databases are updated. Those databases are not Innodb and this is where the locking problem occurs.

Because I have 1000’s of customers that are located all over the world that are constantly accessing the service, I need to minimize downtime of the service to an hour or two, on the weekend, in the wee hours of the morning. I’ve been unsuccessful at achieving this thus far.

Here is what I have tried so far without success.

  1. Brought new node online and allowed it to perform SST using XtraBackup v2. The is process has run for at least 5 hours each time without disruption for most of the time, but always ends up locking up the system at some point due to a new customer creating an account. When a new customer creates an account during this process, the database blocks with WAITING FOR BACKUP LOCK and then no users can access the database at all.

  2. Tried the suggested trick of avoiding SST by performing XtraBackup directly, but XtraBackup runs into the same problem as the SST process when a new customer creates an account.

Since the SST / XtraBackup process runs for most of the time without issue (It appears that the issue occurs in the small window when the MySQL maintained, non-Innodb tables are being copied), the ideal situation would be to shut my service down for the very small window where those databases are being copied to avoid the locking problem. However, there doesn’t seem to be a way to control the order in which databases are copied. Having the problematic databases copied last would be ideal, as SST could run for hours without any disruption and then for the small amount of time needed to copy the non-Innodb databases, the service could be taken offline.

Because I have been unsuccessful at completing the SST process without service disruption, I have no idea how long the actual process takes. The problem always occurs about 5+ hours in. I don’t know how much longer the process needs to take to complete at that point, so I need to kill it so customers can continue to use the system.

I’ve tried to be as concise as possible describing the exact issue. Do you have any solutions? Can the order of database copies be controlled either by SST or direct XtraBackup? Is there a way to copy the non-Innodb databases separately from the Innodb databases, so the service can function normally as the Innodb databases are being copied and then the service can be shutdown for short time while the non-Innodb datbases are copied?

The Avoiding SST trick seems to require a full backup. If I could copy the Innodb and non-Innodb databases separately, then build a new node, that would work also.

I would really appreciate some advice here. I’ve been running on a single node all week and am very nervous. I really need to get the cluster back online as soon as possible.

After further experimentation, I have found that performing an SST using rysnc is substantially faster than using Xtrabackup. In my case, since both methods cause locking which disrupts my service, rsync is the preferred method. Using rsync I can complete SST in an hour, which is manageable for something that doesn’t need to occur very often. Using Xtrabackup as the method takes at least 5 hours and could be much longer as I’ve never fully completed a Xtrabackup SST due to time constraints.