Any ideas on how to identify the source of a problem of replication delay?
Here’s the scenario:
We have a master-slave setup with MySQL 5.0.45 on RHEL64. I am experiencing some serious replication lag that seems to occur for undetermined reasons. There’s very little, if any, activity on the slave aside from replication itself.
InnoDB on the master has 8GB of RAM and the machine itself has dual-processors with quad-cores. On the slave, InnoDB has 12GB of RAM and it also has dual-procs with quad-cores. Both have 15K SCSI drives — they’re beefy machines to say the least.
A Java application adds a large number of records on the master and it gets replicated to the slave from where another Java application reads.
Replication lag occurs and it only increases unless I stop the application on the Master that adds data. Replication eventually catches up, but very slowly.
The main table being updated has around 200M records, so it’s not small, but the hardware should be very sufficient.
innodb_flush_log_at_trx_commit has been changed to various settings but none seems to make a difference.
I installed the mysqltoolkit for the heartbeat monitor application and it reports some oddities. It goes from being 1200 seconds behind to zero instantly when the Java application that’s adding data on the master is stopped.
Here’s an example of the output from the mysql-heartbeat application that measures replication lag:
1234s [ 1237.02s, 1246.21s, 1258.56s
1235s [ 1236.93s, 1246.15s, 1258.51s
1233s [ 1236.80s, 1246.08s, 1258.45s
1234s [ 1236.67s, 1246.02s, 1258.40s
1235s [ 1236.60s, 1245.96s, 1258.35s
1236s [ 1236.53s, 1245.90s, 1258.30s
0s [ 1215.85s, 1241.73s, 1256.87s
0s [ 1195.22s, 1237.55s, 1255.45s
0s [ 1174.57s, 1233.38s, 1254.03s
0s [ 1153.90s, 1229.21s, 1252.60s
0s [ 1009.38s, 1199.99s, 1242.64s
0s [ 968.13s, 1191.64s, 1239.79s
0s [ 947.48s, 1187.47s, 1238.37s
0s [ 926.87s, 1183.30s, 1236.94s
0s [ 906.23s, 1179.13s, 1235.52s
Any thoughts are sincerely appreciated.