Percona cluster upgrade from mysql 5.7 to mysql 8.0 resulted in data loss

Hey folks, we have a still running mysql percona 5.7.39 cluster which we are attempting to upgrade to mysql percona 8.0.

We did the following:

  1. Made an xtrabackup
  2. Restored that backup to the mysql 8.0 cluster
  3. Setup binlog replication between the two
  4. Checked and verified replication was up to date (or so we thought)
During binlog replication we ran into this error (once):
2024-08-09T18:37:11.150622Z 194 [ERROR] [MY-010584] [Repl] Replica SQL for channel '': Worker 1 failed executing transaction 'guid' at source log mysql-bin.000874, end_log_pos 48065054; Could not execute Write_rows event on table <db.table>; Duplicate entr

y '476162808' for key 'table.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's source log mysql-bin.000874, end_log_pos 4

8065054, Error_code: MY-001062

2024-08-09T18:37:11.150744Z 193 [Warning] [MY-010584] [Repl] Replica SQL for channel '': ... The replica coordinator and worker threads are stopped, possib

ly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables

or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: MY-001756

But checking the replica status (show replica status\G) showed that it seemed to have been caught up:


 mysql: [Warning] Using a password on the command line interface can be insecure.
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: <master host>
                  Master_User: <master_user>
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000874
          Read_Master_Log_Pos: 43351392
               Relay_Log_File: mysql-relay.000007
                Relay_Log_Pos: 43349029
        Relay_Master_Log_File: mysql-bin.000874
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 43348893
              Relay_Log_Space: 43351821
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:  <path to ca>.pem
           Master_SSL_CA_Path:
              Master_SSL_Cert:  <path to key>.pem
            Master_SSL_Cipher: <list of ssl ciphers>
               Master_SSL_Key: <path to key>.pem
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: <server id>
                  Master_UUID: <master uuid>
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: <gtids>
            Executed_Gtid_Set: <list of gtids>
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version: TLSv1.2
       Master_public_key_path:
        Get_master_public_key: 0
            Network_Namespace:

So when cutting over to this new source we found out that records were lost, but the binlog shows different, have you seen issues where this happens?

Are you using GTID-based replication or binlog coordinates? If not using GTID, I suggest switching to using GTID as it removes the possibility of entering the wrong binlog position and possibly causing duplicate transactions.

We are using GTIDs here.

Our command (edited to remove identifiable and sensitive info):
CHANGE MASTER TO MASTER_HOST=“mysql57 host”, MASTER_PORT=3306, MASTER_USER=“user with perms”, MASTER_PASSWORD=“thepassword”, MASTER_AUTO_POSITION=1, MASTER_SSL=1, MASTER_SSL_KEY=“ssl file”, MASTER_SSL_CERT=“mastersslcert”, MASTER_SSL_CA=“mastersslca”, MASTER_TLS_VERSION=‘TLSv1.2’, MASTER_SSL_CIPHER=‘ciphers’;

Additional configs from my.cnf:
log_slave_updates=ON
enforce_gtid_consistency=ON
gtid_mode=ON

Any updates here? Using GTIDS and seeing the replica threads running would indicate to me they were working.

Any logs or other debug logs we could take a look at or enable so we can see what’s missing?

You should not restore the 5.7 backup directly on the 8.0 instance. There have been major changes. Suggest you to restore the backup 5.7 on another cluster. Run the upgradechecker utility.
Make sure that the database is compatible for upgrading to 8.0
Once the upgrade checker utility is good, you can do an in-place upgrade.
This is the proper upgrade path to 8.0
It is difficult to say from the data you provided that what went missing.

I understand there have been major changes. Perhaps it wasn’t clear, but here are the precise steps involved.

  1. I have a currently running percona mysql 5.7 cluster
  2. We took a fresh backup from that cluster and dropped it on a new cluster that was to run mysql 8.0.
  3. We ran the checker and seemed ok.
  4. We upgraded in place on the new cluster to mysql 8.0.
  5. This restore worked fine, and the cluster started.
  6. We setup binlog replication using GTIDs between the two clusters 5.7 and 8.0
  7. At this point we saw missing data, restored data was fine, but binlog updates seemed to go missing.
  8. The only warning we saw is log snippet I posted above.

We followed the proper path for upgrading, I am more asking why the binlog replication weny awry or if theres some increased logging we can add to make our lives easier.

It is difficult to say without checking backup and restore logs.
It is possible that the backup was not consistent and that some data may have been missing.
The steps you mentioned are fine. It would have been worth checking what that duplicate entry was when the replication broke and whether the same data exists in the replica.
Also it is possible that backup was not consistent and there may have been errors while backup or restore.
Can be anything.