Master-master replication & failover issue


Sorry if this isn’t the right place to ask this, but the Percona name keeps popping up on Google.

So I have 2 servers dedicated to run to MySQL server 5.5 on QNX, configured with Master-Master replication (asynchronous). We’re using just one server to run queries, you could say that we have an Active / Passive mode.

I have been told the server suddenly can’t perform queries and the connection to the database dies. The client program is written in C and we’re thinking of doing some kind of fail over when the connection dies, maybe on application level. But I’m sure a load balancer would be a better solution.

In any case, this switch to a new master node could have problems due to replication lag, there could be data loss and maybe un-synced tables.

I have read that MySQL 5.5 has semi-sync replication, would this be a good solution to have after switching the database node when it fails??

I would check why the primary server is having such problems in the first place. Does it crash or just won’t allow more connections due to max_connections limit reached?
In the first case, depending on crash nature you may want to switch to the second master and restore the first one later.
To deal with lag you can stop the application, wait until second master has applied all events it managed to get from first master, and then switch application to new master.
But in case the first master is overloaded and some bad queries piled up (or relatively fast queries amount no longer can be processed by server in time), you most likely will end up with the same situation on the second master. The situation may be even worse as the second master has cold buffers.
The general proper way of scaling your reads is to:
1 - optimize queries
2 - add read-only slaves for reads traffic

If you want multi-master setup, with (almost) no lag possible and no problem with master - slave switchovers - there are ready solutions based on Galera replication. Percona XtraDBCluster for instance. It’s synchronous replication where each node is equal, so you can load-balance easily. With semi-sync replication you only get more data safety but still promoting slave is the same difficult as in normal replication.