Since creating backups from a slave db is bad for a few reasons, which db should be used in a master-master-slave scheme? The primary master, right? For now, the secondary master is for automatic failover so I won’t have to promote a slave to master, etc. Or is there a better solution for the whole thing (failover, replication, and backup)?
Why is taking a backup from a slave (or standby master in your case) bad? That is common practice at most companies. As long as you monitor for slave lag and check the slave for consistency (mainly if you use statement based replication), then there should generally be no problem with it.
I thought that relying on the slave for backups is bad because it could be out of sync, corrupt, or in some other bad state and we may not know it. I do use statement-based replication. I will look around for tips to check for consistency. I tried a few pt tools, but I got results that would lead me to believe that the tables I checked were drastically out of sync, when a select * into outfile on both and a diff of the file revealed that they were the same.
Checking for slave lag and consistency is something you should do either way, so definitely a good thing to get that going. Especially if you are using master-master. If you think about it, basically what you are saying is that you cannot rely on the data from your standby master being correct. If that is the case, what use is having the standby master? If you failed over and the data was out of sync, then you’d be in trouble. So once you gain confidence in your standby master being accurate, there is no reason you cannot take a backup from it.
You do bring up another good point, which is that you should always check your backups, including loading the backup on another server on a consistent basis to make sure the backup will actually load (and how long it will take) when you really need it.
The Percona toolkit is pretty solid in both cases.
If you are having trouble with the pt-table-checksum, I’d post up your questions in that subforum and someone from the company should be able to help with specifics.
For slave lag, their pt-heartbeat script works well as it is pretty simplistic ([url]http://www.percona.com/doc/percona-toolkit/2.1/pt-heartbeat.html[/url]). Basically it just inserts a timestamp into a table on the master which replicates to the slaves. It then just compares the timestamp values between the master and slave(s) and determines slave lag. This is preferable to the “seconds behind master” indicator, as that is not very reliable. You could even write a simple bash script to do this yourself if you prefer.
Thanks, Scott. I’ll look into the full toolkit. If the checksums issue repeats, I’ll post in the appropriate place. I was relying on “seconds behind”, but will try the heartbeat script. It’s good a cool name and I’ll be able to say that I’m checking the heartbeat of my db servers.