I’m struggling to work out how to perform ALTER TABLES without down time in a cluster, even with the nice pt-online-schema-change tool.
If there are many other tables which have foreign key references to the table that you are altering, then my experience is that you’ll get a lot of downtime at the end (presumably even if it selects drop_swap). This also doesn’t seem to play nicely when there are multiple nodes running.
So for large tables with many key references to them, I’ve ended up reducing the cluster to a single node, doing an ALTER TABLE and not using pt-online-schema-change, then forcing SSTs to the other nodes to bring them back in by deleting grastate.dat.
This isn’t great - significant down time. Am I missing something? Yes, maybe I’m missing rolling schema upgrades (RSU).
I’ve not tried this yet - but it looks like the process would be:
- take a node out of the load balancer that sends requests to it
- set global wsrep_OSU_method=RSU;
- do the ALTER TABLE
- set global wsrep_OSU_method=TOI;
- put the node back into the load balancer
- repeat for the other nodes
Is that broadly what people do with RSU?
One special case here is the use of ALTER TABLE to reclaim InnoDB space. Is what I outline above a way to reclaim space without down time?
General hints and tips welcome.