OS: CentOS 6.2
MySQL: Percona Server 5.5.25a
All our servers are 64-bit.
We have our replication set up as follows:
prod_A -> prod_B -> prod_C --> internal_D -> internal_E and internal_F
The one double-dash between prod_C and internal_D represents our only slow network connection. All other connections are Gigabit.
All our prod servers have 32G RAM and high-speed disks. Each internal VM has 2G RAM and shares disk space (and thus disk IO) with other VMs on “normal” server hardware.
Our replication is simple: replicate everything, no filters.
Last February, we switched from Statement-based logging to Row-based logging because of problems with Statement-based. Our replication lagged a little bit on occasion after doing this, but not terribly. We were running on 5.5.20 at the time. We upgraded to 5.5.25a on all servers (each slave up the line) a couple of weeks ago and switched to Mixed-based logging. Now we are experiencing terrible lag times on all our internal slaves, starting with the first one. The worst (internal_E) is now 73,000 seconds behind. My first thought was that we were network-bound, but my SysAdmins tell me that we are disk-bound, not network, CPU or anything else bound.
I admit that we do have a lot of write activity on the prod servers, and we also upgrade our homegrown software on a regular basis. We may be doing significantly more writes now than we were last March…I don’t know.
Is there a way I can tell what is causing the lag? Is it just the volume of writes? If yes, how do I prove this? If it is just volume, I can understand why internal_D would get behind, but why would the others get even more behind?
Is there anything I can do to speed up our internal servers?
Do I need to upload or post anything to provide more information?
UPDATE #1: 8-23-2012
A better way to phrase my “volume” question? Is there a way I can prove/disprove that the rate of relay_log entry receipt is greater than the ability of the slave to apply those changes?
I tried to answer this for myself by looking at log and file I/O’s per second, but this has not helped.
Might there be a MySQL bug that is causing a delay? This bug report – http://bugs.mysql.com/bug.php?id=60964 – speaks to the mysqlbinlog tool, but I have found what appears to be the same issue in our relay logs. Is it possible that a relay log switch is causing MySQL to lose the table_mapping information?
UPDATE #2: 8-23-2012
I have set innodb_flush_log_at_trx_commit on all our internal servers to 0. Our production servers still have their default value of 1. This seems to have helped, as it “seems” our internal slaves are getting behind more slowly now. Perhaps this will help further given enough time.