Got fatal error 1236 from master when reading data from binary log

Hello.

This is the error I see when running ‘show slave status’.
Got fatal error 1236 from master when reading data from binary log: ‘log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master; the first event ‘bin-1.084438’ at 48870187, the last event read from ‘/data2/binlogs/bin-1.084438’ at 123, the last byte read from ‘/directory/binlogs/bin-1.084438’ at 48870206.’

I have 1 master and 2 slaves replicating from it.

I slave has resulted in the error.
The other slave did not experience this error.

All settings regarding max packet size is the same on ALL nodes:
mysql> show global variables like ‘%max_allowed%’;
±-------------------------±-----------+
| Variable_name | Value |
±-------------------------±-----------+
| max_allowed_packet | 1073741824 |
| slave_max_allowed_packet | 1073741824 |
±-------------------------±-----------+
2 rows in set (0.00 sec)

The only immediate difference I see is that the mysql server version is slightly different:
master: Server version: 5.7.36-39-log Percona Server
slave with NO error: Server version: 5.7.36-39-log Percona Server
slave with error: Server version: 5.7.39-42-log Percona Server

I see some people suggesting to issue SQL_SLAVE_SKIP_COUNTER on the slave, but I would rather not do this as I would like the slave to remain consistent.

Any ideas out there?

Well mostly like this could be due some loads on your master server, the packet from the master might be more than max allowed packet.
Solutions
Check your conf for max allowed packet in both Slave and master. If it’s low increase in both
Make sure they both have same values both Slave and master.
Also, there is a parameter in Slave server called Slave max allowed packet. Please check for this value. If it’s low or at default you can responsibly change it.

Finally try to skip past that binary log if all these dont work. You can always use pt-checksum and pt-table-sync to sync your SLAVE TO MASTER after.
Pt-table-sync and pt-table-checksums are two fine tools by percona you can’t get it wrong. It’s mature and reliable.
Please act responsibly always try everything out on your test environments before using in production

Did the replica crash recently ?

One possibility is that the binary log entry is corrupted.
Write down the file/pos for current replication, then forget replication settings (STOP SLAVE: RESET SLAVE ALL:) then set replication again to continue from where it left of

Thanks.

I ended up setting a new position, which got us moving again. However, I am going to rebuild the slave as I don’t trust its consistency any more. Unfortunately. I believe this is a Percona bug as it appears the master wrote more than max allowed packets to the binary log.

Regarding the show slave status command. I did see something that appears to be impossible and am looking for some feed back.

At the time of the error the coordinates in the show slave status command seems impossible. Note both IO and SQL threads were in the same binary log file bin-1.084438.

  • Master_Log_File: bin-1.084438
  • Read_Master_Log_Pos: 48870187
  • Relay_Master_Log_File: bin-1.084438
  • Exec_Master_Log_Pos: 765952106

The exec_master_log_pos is a higher number than read_master_log_pos within the same binary log file. I would think that exec_master_log_pos can never be a higher number when in the same binary log file and at most can be the same number. Unless my logic is wrong it would seem at some point things got really messed up here and the sql thread somehow jumped ahead.

Any thoughts/comments would be welcome. Thanks.