Transaction records missing on master but present on slave? Race condition?

My company has been running Percona Server 5.6.13-61.0 for the last year or so. The database handles order transactions, and is replicated to another slave server (same exact version) offsite.

Recently one of our clients reported that an order seemingly disappeared. We randomly would get reports like this – maybe 1 a month, so we looked into it deeper. By chance, we discovered that the records for the order existed on the slave, but were missing on the master! We looked for other records on the slave that were missing on the master, and indeed there were several over the last year.

Looking into it even deeper still, we tried examining what might be special about the orders missing on the master. What we discovered was that for each one of them, they had another unrelated transaction (which do show up on the master) timestamped at the exact same time. For example say you have two users on a site ordering and they hit checkout at the very exact same time, that kind of scenario. Both get committed to the slave but only one makes it to master.

How is it possible for this scenario to happen? What this seems like to me is some sort of race condition bug? Is this an issue known to be a problem in our version of Percona?

With mysql replication, slave servers receive changes via reading and replaying master server’s binlog which records changes that have been committed on the master.

  1. Please check if there is any program that writes directly to the slave. If log_bin is turned on on the slaves, check server_id in binlog. Or set read_only on slave and see if any application reports error.
  2. If using statement-based replication, check master’s error log for “Unsafe statement”. Those are DMLs that are not deterministic and could potential result in different changes on master and slaves.
  3. Compare table schema between master/slaves, especially PK, UK, FK cascade options, and Triggers.
  4. Check for any jobs on master that set session sql_log_bin = 0, which could potential change/delete order records.

1.) I have verified no program is writing directly to slave. The master is the only one connected to the webserver that customers order from and there is no way for the slave to have been given that transaction directly.

2.) We already scanned the MySQL error log for any entry at the time of the transaction, and there were non at all at that time.

3.) I compared the schemas and they are identical. The only thing I noted was there is a table that is part of the transaction that does not have any keys defined on it and a delete is done on it. But that table is identical between master and slave.

4.) There are no sql_log_bin = 0 statements.

The fact that every missing order we found on the slave that is absent on the master has another transaction at the exach hh:mm:ss I believe is extremely significant. That can’t just be by chance?

Sorry I don’t have much help to offer, but I would be very interested in learning more about the issue you are describing.

Disclaimer: I’m not with Percona, and nowhere near as good as they are.

I would look first at following:
Is the slave server set with log_slave_updates? Would you be able to compare the binary logs of the master and server to examine what was written to the binary log on both for a given orderID? If it’s on the slave, it must have been on the master at some point.

​Are you using statement or row based replication?

What type of unique identifier does your orders table use? If it’s autoincremented, is there a gap on the master for the missing order?

Maybe some more details about what’s happening in that transaction when an order is written would help.

Best of luck figuring this one out!

The slave is set with with log_slave_updates

I am using MIXED replication

The unique identifier is a non-autoincrement integer

It just happened again this weekend. With the exact same fingerprint – two transactions submitted at the exach hh:mm:ss to the master, and only one is recorded on the master, the other makes its way to the slave. And I am scrambling to figure it out, but the point is THIS SHOULD NOT BE POSSIBLE!

The thing is, I don’t think this has any real relationship to replication at all. Its just that replication was what helped us verify something is very wrong. These transactions simply vanish on the master.

This continues to happen – exact same pattern of transactions disappearing on the master that appear on the slave!!! It only happens when there are two transactions to the same database committing at exact same time, usually during peak times when the server is under load.

I am willing to try anything, including updating to the latest version of Percona. If i updated from 5.6.13-61.0 to 5.6.21-70.0, would i need to dump and reload the databases, or is it as simple as updating the binaries?

I just did the last things you asked here. I used mysqlbinlog to analyze the log on the master. The insert statements (and commit) of the transaction are all there, as on the slave. But there is no trace of the records that should have been inserted in in the master database.

I did more analysis and one of the tables in the transaction does have an auto-increment primary key. And THERE IS A GAP on the master. The record missing on the master that exists on the slave should have that missing number in the sequence.

Ok, I think I have found another clue.

I did a SHOW ENGINE INNODB STATUS, and on that there is a “LATEST DETECTED DEADLOCK” section and it lists a couple statements i can tell are from a missing transaction on the master as “WAITING FOR THIS LOCK TO BE GRANTED”.

Then the last line of that section says “WE ROLL BACK TRANSACTION”.

There are two huge problems here:

1.) records for that transaction, which I presume was rolled back on the master, exist on the slave but not the master. THIS SHOULD NOT BE POSSIBLE??
2.) no error was sent to the application layer – the master acted as if the COMMIT worked.

Hello toon1
I have contacted some of the tech team here, they don’t think you have encountered a bug on this occasion.
There are a lot of reasons you could be getting this behaviour, and it’s very difficult to diagnose from here.
However, their suggestion is that perhaps your app isn’t checking return codes on a lock failure/deadlock. Or it might be misinterpreting a error meaning.
Perhaps if you can investigate that first?