Resync mysql replica in a GTID setup gives error HA_ERR_FOUND_DUPP_KEY with AUTOPOSITION

Hello

I use a procedure based on this page for years to sync a new replica or resync a broken replica without issue (I automated it with ansible).
This uses xtrabackup to save and send the mysql data to the slave, apply the redo log, and restart the slave with MASTER_AUTO_POSITION. Simple and effective.
xtrabackup is convenient for us, due to the size of the databases we have to transfer. we use this way xtrabackup --backup --stream=xbstream --parallel=$(NB_PROC/2)

mysql: 5.7.42
xtrabackup: 2.4.29

lately when I start the slave I get such error

                   Last_Errno: 1062
                   Last_Error: Could not execute Write_rows event on table db_2.webhook_message_status; Duplicate entry '3304591' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.001615, end_log_pos 5021
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4692
              Relay_Log_Space: 143514642
...
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1062
               Last_SQL_Error: Could not execute Write_rows event on table db_2.webhook_message_status; Duplicate entry '3304591' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.001615, end_log_pos 5021
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 189087
                  Master_UUID: 144c0164-3223-11ef-8319-74563c5c838d
             Master_Info_File: mysql.slave_master_info

Looking on slave I confirm the entry with id 3304591 is already there

+---------+---------------------+---------+--------+-----------------------------+---------------------+
| id      | message_external_id | site_id | status | context                     | created_at          |
+---------+---------------------+---------+--------+-----------------------------+---------------------+
| 3304591 | xxxxxxxxxxxxxxxxxxx |     483 | read   | {"code":200,"title":"read"} | 2024-09-19 07:52:00 |
+---------+---------------------+---------+--------+-----------------------------+---------------------+

so it seems the slave does not know anymore to properly position itself.

As the workaround I get the content of xtrabackup_binlog_info

mysql-bin.001615        73610932        144c0164-3223-11ef-8319-74563c5c838d:1-14071690

and I did this

mysql> reset master
mysql> set global GTID_PURGED="144c0164-3223-11ef-8319-74563c5c838d:1-14071690"
mysql> start slave

and now it works

             Master_Server_Id: 189087
                  Master_UUID: 144c0164-3223-11ef-8319-74563c5c838d
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
...
           Retrieved_Gtid_Set: 144c0164-3223-11ef-8319-74563c5c838d:14047201-14248907
            Executed_Gtid_Set: 144c0164-3223-11ef-8319-74563c5c838d:1-14248907
                Auto_Position: 1

Do you have an idea about what could be the cause of this problem? It used to work fine, during the last 3 years we use without an issue. We did not changed major version of MySQL, or anytool involved.
is the “workaround” I’m doing is fine ? Do I have all master data on slave ?

best

Hello @Baptiste_Mille-Mathi,
Run through your restore steps manually, one at a time. After prepare/restore on replica, check the gtid_purged before you start replication. Does this match what’s inside the xtrabackup_binlog_info file?

The replication error says end_log_pos 5021 but your binlog_info file says to start at 73610932. There’s something not aligned with your procedure. Take ansible out of the equation and do everything manually and see where it breaks down.

Hi @matthewb

following your advice, I did almost all steps manually (all firewall opening, data sync, and data prepare were done using ansible, considering the amount of tasks it involves)

here the results after each steps

step: after data was sync to slave from xtrabackup.

root@slave:~# cat /var/lib/mysql/xtrabackup_binlog_info                    
mysql-bin.000004        42207931        c4501c2f-2e49-11ef-a280-d05099fde95a:1-721169

step: after starting mysql slave service (but before start slave step)

(root@slave) [(none)]> select @@gtid_purged;
+-----------------------------------------------+       
| @@gtid_purged                                 |       
+-----------------------------------------------+       
| c4501c2f-2e49-11ef-a280-d05099fde95a:1-641168 |       
+-----------------------------------------------+                                 
# so it is the value when I stopped manually the replica few couples of hours ago
                                                                                          
(root@slave) [(none)]> select * from mysql.gtid_executed;    
+--------------------------------------+----------------+--------------+ 
| source_uuid                          | interval_start | interval_end | 
+--------------------------------------+----------------+--------------+ 
| c4501c2f-2e49-11ef-a280-d05099fde95a |              1 |       641168 | 
+--------------------------------------+----------------+--------------+ 

step: change master

CHANGE MASTER TO MASTER_HOST='master', MASTER_USER='xxxxxxxxx', MASTER_SSL=1, MASTER_PASSWORD='xxxxxxxx', MASTER_AUTO_POSITION=1;

*step: show gtid variables again

(root@slave) [(none)]> select @@gtid_purged;
+-----------------------------------------------+       
| @@gtid_purged                                 |       
+-----------------------------------------------+       
| c4501c2f-2e49-11ef-a280-d05099fde95a:1-641168 |       
+-----------------------------------------------+                                 
# so it is the "old" value when I stop manually the replica few couples of hours ago
                                                                                          
(root@slave) [(none)]> select * from mysql.gtid_executed;    
+--------------------------------------+----------------+--------------+ 
| source_uuid                          | interval_start | interval_end | 
+--------------------------------------+----------------+--------------+ 
| c4501c2f-2e49-11ef-a280-d05099fde95a |              1 |       641168 | 
+--------------------------------------+----------------+--------------+ 

(root@slave) [(none)]> show slave status\G 
*************************** 1. row ***************************              
               Slave_IO_State:                                              
                  Master_Host: master                
                  Master_User:                                              
                  Master_Port: 3306                                         
                Connect_Retry: 60                                           
              Master_Log_File:                                              
          Read_Master_Log_Pos: 4                                            
               Relay_Log_File: slave-relay-bin.000001           
                Relay_Log_Pos: 4                                            
        Relay_Master_Log_File:                                              
             Slave_IO_Running: No                                           
            Slave_SQL_Running: No                                           
              Replicate_Do_DB:                                              
          Replicate_Ignore_DB:                                              
           Replicate_Do_Table:                                              
       Replicate_Ignore_Table:                                              
      Replicate_Wild_Do_Table:                                              
  Replicate_Wild_Ignore_Table:                                              
                   Last_Errno: 0                                            
                   Last_Error:                                              
                 Skip_Counter: 0                                            
          Exec_Master_Log_Pos: 0                                            
              Relay_Log_Space: 154                                          
              Until_Condition: None                                         
               Until_Log_File:                                              
                Until_Log_Pos: 0                                            
           Master_SSL_Allowed: Yes                                          
           Master_SSL_CA_File:                                              
           Master_SSL_CA_Path:                                              
              Master_SSL_Cert:                                              
            Master_SSL_Cipher:                                              
               Master_SSL_Key:                                              
        Seconds_Behind_Master: NULL                                         
Master_SSL_Verify_Server_Cert: No                                           
                Last_IO_Errno: 0                                            
                Last_IO_Error:                                              
               Last_SQL_Errno: 0                                            
               Last_SQL_Error:                                              
  Replicate_Ignore_Server_Ids:                                              
             Master_Server_Id: 0                                            
                  Master_UUID:                                              
             Master_Info_File: mysql.slave_master_info                      
                    SQL_Delay: 0                                            
          SQL_Remaining_Delay: NULL                                         
      Slave_SQL_Running_State:                                              
           Master_Retry_Count: 86400                                        
                  Master_Bind:                                              
      Last_IO_Error_Timestamp:                                              
     Last_SQL_Error_Timestamp:                                              
               Master_SSL_Crl:                                              
           Master_SSL_Crlpath:                                              
           Retrieved_Gtid_Set:                                              
            Executed_Gtid_Set: c4501c2f-2e49-11ef-a280-d05099fde95a:1-641168
                Auto_Position: 1                                            
         Replicate_Rewrite_DB:                                              
                 Channel_Name:                                              
           Master_TLS_Version:                                              

step: start slave

*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: master
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 39711620
               Relay_Log_File: slave-relay-bin.000002
                Relay_Log_Pos: 367
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1062
                   Last_Error: Could not execute Write_rows event on table xxxxx.Messages; Duplicate entry '47763668' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000004, end_log_pos 473
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 194
              Relay_Log_Space: 39712012
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1062
               Last_SQL_Error: Could not execute Write_rows event on table xxxxx.Messages; Duplicate entry '47763668' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000004, end_log_pos 473
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 4448098
                  Master_UUID: c4501c2f-2e49-11ef-a280-d05099fde95a
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 240923 12:27:19
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: c4501c2f-2e49-11ef-a280-d05099fde95a:641169-715915
            Executed_Gtid_Set: c4501c2f-2e49-11ef-a280-d05099fde95a:1-641168
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
                    

so if I understand correctly the slave is restarting from gtid 641169 despite having being synched to 721169 at the start of the procedure.

On your side, you referred to binary log file position discrepanct, I thought once transitioning to gtid replication, it was not used anymore. Can you clarify this point ?

as a workaround, I found pt-slave-restart could bypass all duplicated keys issue but I’d like to fix correctly the issue.

Yes, it is strange that your restored backup’s gtid_executed is not aligned with the data from the backup binlog info file. But those should be aligned, and then you won’t have any duplicate data issues.