GTID Migration

Little Background:
masters running version 5.6.17-65.0-rel65.0-log no GTID enabled
replicas running version 5.6.29-76.2-log

We are looking to migrate all our servers to GTID. I have setup many of them already (getting ready for migration), basically I have an old master and setup a new slave to point to the master (without GTID). I set SET GLOBAL gtid_deployment_step = ON on the new replica before I sync up the replica using a backup that comes from xtrabackup. This all works fine, database sync’s up and catches up with no errors. I can see the GTID information in the show master status and show slave status information.

What we found was after about a week to 3 weeks…it’s a bit hard to tell when, the GTID information seems to stop being used. The show master status and show slave status both stop showing GTID information. I can check GTID variables using show variables like ‘%version%’ and everything looks OK (except that the gtid_executed is blank which is always blank even on a working server so not sure if this is actually a problem, on a functioning server gtid_purged would have a GTID value)…IE I get:

mysql> show variables like ‘%gtid%’;
| Variable_name | Value |
| binlog_gtid_simple_recovery | OFF |
| enforce_gtid_consistency | ON |
| gtid_deployment_step | ON |
| gtid_executed | |
| gtid_mode | ON |
| gtid_next | AUTOMATIC |
| gtid_owned | |
| gtid_purged | |
| simplified_binlog_gtid_recovery | OFF |

We have rerun the same process on 7 different clusters and the oldest ones (5) seem to have lost GTID usage and the latest ones (2) we sync’ed up are still functioning as expected. I have also looked at the mysql logs and there is no indication of issues. Replication itself seams to work just fine…(the master binlog position on the replica matches the binlog position on the master).

I have looked all over for a bug of this type but so far no hits.
Any ideas or suggestions from you members?
Many thanks,

Can you paste full SHOW SLAVE STATUS\G from the replica in question?

Thank for you help Jrivera. As requested here is the slave status, normally I would have values for Retrieved_Gtid_Set and Executed_Gtid_Set:

mysql> show slave status \G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysqld-bin.001440
Read_Master_Log_Pos: 629057388
Relay_Log_File: mysqld-relay-bin.000294
Relay_Log_Pos: 629057583
Relay_Master_Log_File: mysqld-bin.001440
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Skip_Counter: 0
Exec_Master_Log_Pos: 629057388
Relay_Log_Space: 629057836
Until_Condition: None
Until_Log_Pos: 0
Master_SSL_Allowed: No
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_SQL_Errno: 0
Master_Server_Id: 168160
Master_UUID: f34d3ed9-eba6-11e3-8382-22000b2b010f
Master_Info_File: /data/mysql/
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Master_Retry_Count: 86400
Auto_Position: 0
1 row in set (0.00 sec)

I thought I would try something with the replica that showed behaviors of having issues and chained another replica to it. The backup from 1st tier replica did not include GTID information (using innobackup) even though GTID is turned on and so I ran the change master command with CHANGE MASTER TO MASTER_AUTO_POSITION=1; with the binlog name in the backup but left the position out. Started the Slave and to my surprise the slave thread connected and show slave status shows Slave_IO_Running: yes and Slave_SQL_Running: yes, no errors and the Master_Log_File is correct and the Read_Master_Log_Position keeps up with the 1st level replica. Looked completely clean to me except I expected it to have Retrieved_Gtid_Set and Executed_Gtid_Set to have values. The unexpected behaviour was this backup was from 3 days ago on a busy servers so I expected seconds behind to be in the days to catch up but it connected with 0 seconds behind.

Expecting something is still wrong I decided to count lines in a busy table and found that the table was missing 40K rows. I ran the server for another day and the 40K row stayed consistent and the show slave status showed that the database was still following the master log position. I was able to connect a replica to a another server without error but in fact I had lost a lot of data…not cool.