odd replication lag situation


I have 4 servers - 1 master and 3 slaves. The 3 slaves are identical in their hardware specs and setup.

slave1 and slave2 are doing more work than slave3, but still I see frequent (every 10 seconds) lag in replication on slave3, where I see none on slave1 and 2.

Even if all the work that slave4 does is pointed at slave1 and 2 on top of their normal workload, they still show zero lag.

So this isn’t a question about how to fight replication lag in general, which has been covered a lot. It’s a question about how to check what on this particular server is causing the lags.

The ONLY difference I know of is that the first to are running 5.0.51a, and 3 is running 5.0.67. I tried to check the changelogs to see if there has been a change in the way lag is calculated and reported, but can see nothing. Also, it’s real lag (the app sees it) as opposed to just lag being reported differently.

Any help much appreciated

How do you meassure the lag?

Do all servers have correct date/time settings? We had this kind of problem on some of our servers…


So I spent a little more time analyzing the work that this box was doing vs the other 2. Turns out it’s used for the reporting application, which by nature does a lot of queries that cover a lot more rows.

This was causing table locks on tables that the slave sql thread was trying to update. The slave thread, having a lower priority, effectively backgrounded itself until these queries finished so the “seconds behind master” was climbing up periodically while the queries were running.

Now looking to change particular tables to innoDB to achieve row level locks instead.

Many thanks