Hello, i’ve got strange situation with gtid replication.
I’ve single master and 3 replicas, all’s went well for years. But suddenly, i notice, that Replication lag grows on all my replicas, while Slave_IO_Running and Slave_SQL_Running both YES and no errors in logs. Well, i research and found: all replicas successfuly get changes from master (i mean Retrieved_Gtid_Set constantly growing) but not applyes it (Executed_Gtid_Set stuck). I tried to stop / start slave - no changes. No error logs.
What else i could try to start replica?
Welcome back Deniska80,
Was is increased traffic?
Do you have PMM to observe if it s only happening at times or it is just not progressing?
Would you like to provide with os / db version, show replica status output, additional configuration options like parallelization?
Thanks,
K
Was is increased traffic?
No. Normal work. On master about 10-12K queries per second. My replication lag allways was about 0 second
Do you have PMM to observe if it s only happening at times or it is >just not progressing?
I have PMM and nagios, and monitor replication lag. It’s first time for years
Would you like to provide with os / db version, show replica status >output, additional configuration options like parallelization?
Mysql Percona 5.7.42-46 on Debian 11 on all hosts.
On MASTER (i leave just one replica now)
mysql> show slave hosts;
+-----------+---------------+------+-----------+--------------------------------------+
| Server_id | Host | Port | Master_id | Slave_UUID |
+-----------+---------------+------+-----------+--------------------------------------+
| 5 | db5 | 3306 | 7 | ef021b00-0de0-11ed-996a-ac1f6bc56cd8 |
+-----------+---------------+------+-----------+--------------------------------------+
On replica
Slave_IO_State: Waiting for master to send event
Master_Host: db7
Master_User: replicant
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.005774
Read_Master_Log_Pos: 54243412
Relay_Log_File: db5-relay-bin.013740
Relay_Log_Pos: 104199984
Relay_Master_Log_File: mysql-bin.005085
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 104199811
Relay_Log_Space: 72318318314
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: Yes
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 81947
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 7
Master_UUID: 61257650-3280-11e8-b9ad-ac1f6b1eb1ba
Master_Info_File: /mnt/nvmedisk/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Reading event from the relay log
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 61257650-3280-11e8-b9ad-ac1f6b1eb1ba:51494973-301052241
Executed_Gtid_Set: 61257650-3280-11e8-b9ad-ac1f6b1eb1ba:1-268467117,
6990c29b-845c-11eb-afba-ac1f6bba0546:1-7384031159,
74f8eae9-8b42-11e9-926e-ac1f6bba0546:1-4998651585,
8a4941c2-4cb2-11ea-bf93-ac1f6b1b7751:1-792724123,
9eb9cd4e-eb7a-11eb-bfcf-3cecef0d25ac:1-52,
c2d3c5b5-6f94-11e8-9239-0cc47a4ce744:1-220428248,
ef021b00-0de0-11ed-996a-ac1f6bc56cd8:1-7348802023
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0,00 sec)
you should inspect binary logs, what is likely happening is a long running transaction happened on the primary (for example ALTER of a large table)
Well, but if master allready comlete this transaction, i think that slaves could complete it too (replication lag is allready 18 hours)? Hardware is identical on all servers.
Some operations are online on the primary but block the replication thread. Again, try inspecting the binary log position to identify what is going on
Well, i try to inspect first relay log to find transaction next to last executed
SET @@SESSION.GTID_NEXT= '61257650-3280-11e8-b9ad-ac1f6b1eb1ba:268467118'/*!*/;
# at 104200049
#230704 15:44:22 server id 7 end_log_pos 104199963 CRC32 0xe662dc30 Query thread_id=225265222 exec_time=0 error_code=0
SET TIMESTAMP=1688474662/*!*/;
BEGIN
/*!*/;
# at 104200136
#230704 15:44:22 server id 7 end_log_pos 104200034 CRC32 0x12151ffa Table_map: `bea`.`companies_statistics` mapped to number 580
# at 104200207
#230704 15:44:22 server id 7 end_log_pos 104208238 CRC32 0x35bc4f2c Delete_rows: table id 580
# at 104208411
#230704 15:44:22 server id 7 end_log_pos 104216442 CRC32 0x5fccedd3 Delete_rows: table id 580
# at 104216615
#230704 15:44:22 server id 7 end_log_pos 104224646 CRC32 0xd54df9e7 Delete_rows: table id 580
....
....
a lot of Delete_rows
#230704 15:44:22 server id 7 end_log_pos 110771438 CRC32 0x4fa1d116 Delete_rows: table id 580
# at 110771611
#230704 15:44:22 server id 7 end_log_pos 110772502 CRC32 0xd207d93d Delete_rows: table id 580 flags: STMT_END_F
Nothing suspicoious, as i can see. It’s the last transaction in this relay file. Next files start with next trannsaction…
Hi @Deniska80
Based on the output shared from your relaylog, I would guess the following:
- Your table bea.companies_statistics has (or had) a lot of records.
- A DELETE statement was issued in your primary node (Maybe deleted all the records).
- Your binlog_format is configured as ROW in the primary.
Let’s say you had 100 million records on your table, you deleted with a single statement in the primary the whole table; this translates into 100 million deletes in the binary (and thus, the same number of deletes has to be executed in the replica).
You could also check the binlog size in the primary, I’m guessing this one may be bigger than the other binlogs.
If the assumptions are correct, then waiting is the best option here; nothing much to speed up your replication.
There could be other approaches, but with added risk.
Best,
Mauricio.
I checked binlog on master, all files with the same size (just a little different, some kilobytes).
This table not huge, about 11M records, and that transaction deleted about 300K records.
I allready restore replication on all of my slaves (with full backup with xtrabackup and copy dump to slave).
Hello @Deniska80,
I’d leave here with few references that may be relevant for slow replicas:
for setting up multi threaded replication.
- Can MySQL Parallel Replication Help My Slave?
- A Dive Into MySQL Multi-Threaded Replication - Percona Database Performance Blog
for doing data archival (purge) we recommend using pt-archiver
- Want to archive tables? Use Percona Toolkit's pt-archiver
- pt-archiver — Percona Toolkit Documentation
Thanks,
K.
… As i mentioned before, I leaved one of my replicas in this “stuck” situation, just for case… and ooops, after 41h of waiting it starts to execute relayed operations
It’s realy strange for me, i just can’t understand, how opeartion, that takes few seconds on master can take almost 2 days on slaves.
Try to follow links above.
Thanks to all