Hi All,
we are trying to switch from traditional replication to gtid based replication with the online gtid deployment method percona server 5.6 offers.
Before I describe the issue in more detail here are some Information about our setup.
We have 4 mysql instances in a multi tier replication setup running Debian 6 or 8.
Server 1: Master on debian 6 with percona server 5.6.24-72.2-1.squeeze
Server 2: Slave to Server 1 on debian 6 with percona server 5.6.24-72.2-1.squeeze
Server 3: Slave to Server 1 and reporting master on debian 8 with percona server 5.6.27-75.0-1.jessie
Server 4: Slave to Server 3 on debian 8 with percona server 5.6.27-75.0-1.jessie
On Monday we enabled gtid_mode and gtid_deployment_step on Server 4 and a day later on Server 3. Server 1 and 2 are planned to follow next week with a master switch to Server 2, it is also planned to upgrade them to debian 8.
Everything ran smooth until yesterday when Server 4 began to stop processing queries from time to time. Server 3 started today with that behaviour.
strace revealed that percona server scans the entire binlog in these time periods. New connections are possible but as I said no queries are processed until the scan is complete and that can take a few minutes.
Sometimes, but only on Server 3, around the same time as the hang occurs percona server logs this:
[ERROR] Error in Log_event::read_log_event(): read error, data_len: 8191, event_type: 30
[Warning] Error reading GTIDs from binary log: -1
Which is a little strange since server 1 does not have gtid_mode enabled.
One detail that just struck me, our binlog retention is 2 days, it can hardly be a coincidence that the problems appeared 2 days after making the change.
I suspect that gtid_deployment_step is just not build for “long term” use and proceeding with the deployment could solve the problem.
But I can not be sure and thus I’m a bit reluctant to proceed.
The only way back seems to be purging binlogs on Server 3 and 4 because when restarting them with gtid_mode disabled they complain about gtids in the binlog and refuse to replicate …
Can someone confirm my suspicion or can help me in any other way?
Thanks in advance!