Just ran into a problem in our production environment - we ran pt-online-schema-change to add an index to a table. It wasn’t able to connect to the slave DBs, and as such, the slave DB’s replication lag went crazy as the tool was not monitoring the lag.
Here are the logs (I’ve edited out the sensitive information):
[eliot@ ~]$ sudo pt-online-schema-change --execute --alter “ADD INDEX colname (colname)” D=dbname,t=tablename
1> Cannot connect to h=<SLAVE_IP_1>
1> Cannot connect to h=<SLAVE_IP_2>
No slaves found. See --recursion-method if host has slaves.
Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
Clearly it knew there were slaves, but because it couldn’t connect to them, and it looks like it assumed that they didn’t exist (as opposed to knowing that they do exist).
Is there a flag I’m missing that could get the tool to abort this change if it can’t connect to its slaves? The --dry-run version also makes no mention of these slave connections, which is another reason why it slipped my mind/didn’t ring any alarm bells.