Strange replication's stuck

Hello, i’ve got strange situation with gtid replication.
I’ve single master and 3 replicas, all’s went well for years. But suddenly, i notice, that Replication lag grows on all my replicas, while Slave_IO_Running and Slave_SQL_Running both YES and no errors in logs. Well, i research and found: all replicas successfuly get changes from master (i mean Retrieved_Gtid_Set constantly growing) but not applyes it (Executed_Gtid_Set stuck). I tried to stop / start slave - no changes. No error logs.
What else i could try to start replica?

Welcome back Deniska80,

Was is increased traffic?
Do you have PMM to observe if it s only happening at times or it is just not progressing?
Would you like to provide with os / db version, show replica status output, additional configuration options like parallelization?


Was is increased traffic?

No. Normal work. On master about 10-12K queries per second. My replication lag allways was about 0 second

Do you have PMM to observe if it s only happening at times or it is >just not progressing?

I have PMM and nagios, and monitor replication lag. It’s first time for years

Would you like to provide with os / db version, show replica status >output, additional configuration options like parallelization?

Mysql Percona 5.7.42-46 on Debian 11 on all hosts.

On MASTER (i leave just one replica now)

mysql> show slave hosts;
| Server_id | Host          | Port | Master_id | Slave_UUID                           |
|         5 | db5           | 3306 |         7 | ef021b00-0de0-11ed-996a-ac1f6bc56cd8 |

On replica

               Slave_IO_State: Waiting for master to send event
                  Master_Host: db7
                  Master_User: replicant
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.005774
          Read_Master_Log_Pos: 54243412
               Relay_Log_File: db5-relay-bin.013740
                Relay_Log_Pos: 104199984
        Relay_Master_Log_File: mysql-bin.005085
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                   Last_Errno: 0
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 104199811
              Relay_Log_Space: 72318318314
              Until_Condition: None
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
        Seconds_Behind_Master: 81947
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
               Last_SQL_Errno: 0
             Master_Server_Id: 7
                  Master_UUID: 61257650-3280-11e8-b9ad-ac1f6b1eb1ba
             Master_Info_File: /mnt/nvmedisk/mysql/
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Reading event from the relay log
           Master_Retry_Count: 86400
           Retrieved_Gtid_Set: 61257650-3280-11e8-b9ad-ac1f6b1eb1ba:51494973-301052241
            Executed_Gtid_Set: 61257650-3280-11e8-b9ad-ac1f6b1eb1ba:1-268467117,
                Auto_Position: 1
1 row in set (0,00 sec)

you should inspect binary logs, what is likely happening is a long running transaction happened on the primary (for example ALTER of a large table)

Well, but if master allready comlete this transaction, i think that slaves could complete it too (replication lag is allready 18 hours)? Hardware is identical on all servers.

Some operations are online on the primary but block the replication thread. Again, try inspecting the binary log position to identify what is going on

Well, i try to inspect first relay log to find transaction next to last executed

SET @@SESSION.GTID_NEXT= '61257650-3280-11e8-b9ad-ac1f6b1eb1ba:268467118'/*!*/;
# at 104200049
#230704 15:44:22 server id 7  end_log_pos 104199963 CRC32 0xe662dc30    Query   thread_id=225265222     exec_time=0     error_code=0
SET TIMESTAMP=1688474662/*!*/;
# at 104200136
#230704 15:44:22 server id 7  end_log_pos 104200034 CRC32 0x12151ffa    Table_map: `bea`.`companies_statistics` mapped to number 580
# at 104200207
#230704 15:44:22 server id 7  end_log_pos 104208238 CRC32 0x35bc4f2c    Delete_rows: table id 580
# at 104208411
#230704 15:44:22 server id 7  end_log_pos 104216442 CRC32 0x5fccedd3    Delete_rows: table id 580
# at 104216615
#230704 15:44:22 server id 7  end_log_pos 104224646 CRC32 0xd54df9e7    Delete_rows: table id 580
a lot of Delete_rows
#230704 15:44:22 server id 7  end_log_pos 110771438 CRC32 0x4fa1d116    Delete_rows: table id 580
# at 110771611
#230704 15:44:22 server id 7  end_log_pos 110772502 CRC32 0xd207d93d    Delete_rows: table id 580 flags: STMT_END_F

Nothing suspicoious, as i can see. It’s the last transaction in this relay file. Next files start with next trannsaction…

Hi @Deniska80

Based on the output shared from your relaylog, I would guess the following:

  • Your table bea.companies_statistics has (or had) a lot of records.
  • A DELETE statement was issued in your primary node (Maybe deleted all the records).
  • Your binlog_format is configured as ROW in the primary.

Let’s say you had 100 million records on your table, you deleted with a single statement in the primary the whole table; this translates into 100 million deletes in the binary (and thus, the same number of deletes has to be executed in the replica).

You could also check the binlog size in the primary, I’m guessing this one may be bigger than the other binlogs.

If the assumptions are correct, then waiting is the best option here; nothing much to speed up your replication.

There could be other approaches, but with added risk.


1 Like

I checked binlog on master, all files with the same size (just a little different, some kilobytes).

This table not huge, about 11M records, and that transaction deleted about 300K records.

I allready restore replication on all of my slaves (with full backup with xtrabackup and copy dump to slave).

Hello @Deniska80,

I’d leave here with few references that may be relevant for slow replicas:

for setting up multi threaded replication.

for doing data archival (purge) we recommend using pt-archiver


… As i mentioned before, I leaved one of my replicas in this “stuck” situation, just for case… and ooops, after 41h of waiting it starts to execute relayed operations

It’s realy strange for me, i just can’t understand, how opeartion, that takes few seconds on master can take almost 2 days on slaves.

Try to follow links above.

Thanks to all