replication lag

Any ideas on how to identify the source of a problem of replication delay?

Here’s the scenario:

We have a master-slave setup with MySQL 5.0.45 on RHEL64. I am experiencing some serious replication lag that seems to occur for undetermined reasons. There’s very little, if any, activity on the slave aside from replication itself.

InnoDB on the master has 8GB of RAM and the machine itself has dual-processors with quad-cores. On the slave, InnoDB has 12GB of RAM and it also has dual-procs with quad-cores. Both have 15K SCSI drives — they’re beefy machines to say the least.

A Java application adds a large number of records on the master and it gets replicated to the slave from where another Java application reads.

Replication lag occurs and it only increases unless I stop the application on the Master that adds data. Replication eventually catches up, but very slowly.

The main table being updated has around 200M records, so it’s not small, but the hardware should be very sufficient.

innodb_flush_log_at_trx_commit has been changed to various settings but none seems to make a difference.

I installed the mysqltoolkit for the heartbeat monitor application and it reports some oddities. It goes from being 1200 seconds behind to zero instantly when the Java application that’s adding data on the master is stopped.

Here’s an example of the output from the mysql-heartbeat application that measures replication lag:

1234s [ 1237.02s, 1246.21s, 1258.56s
1235s [ 1236.93s, 1246.15s, 1258.51s
1233s [ 1236.80s, 1246.08s, 1258.45s
1234s [ 1236.67s, 1246.02s, 1258.40s
1235s [ 1236.60s, 1245.96s, 1258.35s
1236s [ 1236.53s, 1245.90s, 1258.30s
0s [ 1215.85s, 1241.73s, 1256.87s
0s [ 1195.22s, 1237.55s, 1255.45s
0s [ 1174.57s, 1233.38s, 1254.03s
0s [ 1153.90s, 1229.21s, 1252.60s
0s [ 1009.38s, 1199.99s, 1242.64s
0s [ 968.13s, 1191.64s, 1239.79s
0s [ 947.48s, 1187.47s, 1238.37s
0s [ 926.87s, 1183.30s, 1236.94s
0s [ 906.23s, 1179.13s, 1235.52s

Any thoughts are sincerely appreciated.

What is the CPU utilization on the master and the slave?

If you use replication and you have multiprocessor/core machines and you have a very high insert/update rate you could have hit a bottleneck in MySQLs replication.

The problem is that MySQL’s replication is only performed within one thread so it can basically only use the performance of one core.
While on the master, inserts are usually performed through a lot of threads, so it can use the performance of all the core’s and hence achieve a higher performance.

So as long as the insert/update load is higher than what one core on the slave can handle the replication will fall behind.

Unfortunately there isn’t any configuration change you can do to MySQL to solve this problem.
The only way to circumvent it is to put the replication into your application so that it is spread out over several threads and uses more than one core.