We have a slave server that, during time of a large report, queries coming in from the master start to time out (1205 error).
I want to use pt-slave-restart to keep attempting “START SLAVE” until the error clears, not skip. I’m not clear from the documentation how slave restart acts.
From the documentation below, what happens exactly when there is an error?
pt-slave-restart sleeps intelligently between polling the slave. The current sleep time varies.
The initial sleep time is given by --sleep.
If it checks and finds an error, it halves the previous sleep time.
If it finds no error, it doubles the previous sleep time.
The sleep time is bounded below by --min-sleep and above by --max-sleep.
Immediately after finding an error, pt-slave-restart assumes another error is very likely to happen next, so it sleeps the current sleep time or the initial sleep time, whichever is less.
[This says it keeps sleeping but doesn’t say "Upon finding an error, pt-slave-restart attempts to skip 1 error and then try again | pt-slave-restart attempts to start the slave again, and if fails X times, it will attempt a skip "]