My company has been running Percona Server 5.6.13-61.0 for the last year or so. The database handles order transactions, and is replicated to another slave server (same exact version) offsite.
Recently one of our clients reported that an order seemingly disappeared. We randomly would get reports like this – maybe 1 a month, so we looked into it deeper. By chance, we discovered that the records for the order existed on the slave, but were missing on the master! We looked for other records on the slave that were missing on the master, and indeed there were several over the last year.
Looking into it even deeper still, we tried examining what might be special about the orders missing on the master. What we discovered was that for each one of them, they had another unrelated transaction (which do show up on the master) timestamped at the exact same time. For example say you have two users on a site ordering and they hit checkout at the very exact same time, that kind of scenario. Both get committed to the slave but only one makes it to master.
How is it possible for this scenario to happen? What this seems like to me is some sort of race condition bug? Is this an issue known to be a problem in our version of Percona?