Replication ERROR: Missing transaction in replicas

Hi,

I have come across an error with replication.
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction ‘dedc2c0a-de85-11ec-80bb-48df37590a18:2004018060’ at source log ab_binlog.034612, end_log_pos 553967352; Could not execute Update_rows event on table jamfsoftware.mobile_device_app_deployment_queue; Can’t find record in ‘mobile_device_app_deployment_queue’, Error_code: 1032; Can’t find record in ‘mobile_device_app_deployment_queue’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event’s source log ab_binlog.034612, end_log_pos 553967352

Set up- One Primary and three replicas.

All three replicas failed at exact same position/update statement.

I looked at binary log of primary, event has been replicated and I see it in relay log of all three replicas.
Replicas has log-replica-updates enabled and binary log of replicas has the event (update statement which is missing on replicas that caused replication error)

Binary log event where replication failed:
#231007 23:49:30 server id 2 end_log_pos 553967352 CRC32 0xab6b23c9 Update_rows: table id 680 flags: STMT_END_F

UPDATE jamfsoftware.mobile_device_app_deployment_queue

WHERE

@1=246625 /* INT meta=0 nullable=0 is_null=0 */

@2=28 /* INT meta=0 nullable=0 is_null=0 */

@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */

@4=‘Prompting’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */

@5=1696722308447 /* LONGINT meta=0 nullable=0 is_null=0 */

SET

@1=246625 /* INT meta=0 nullable=0 is_null=0 */

@2=28 /* INT meta=0 nullable=0 is_null=0 */

@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */

@4=‘ManagedButUninstalled’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */

@5=1696722570463 /* LONGINT meta=0 nullable=0 is_null=0 */

at 553967352

Applied on Primary:
root> select * from jamfsoftware.mobile_device_app_deployment_queue where device_id=246625 and mobile_device_app_id=28;
±----------±---------------------±--------------------±----------------------±------------------+
| device_id | mobile_device_app_id | date_deployed_epoch | current_state | last_update_epoch |
±----------±---------------------±--------------------±----------------------±------------------+
| 246625 | 28 | 1675373156420 | ManagedButUninstalled | 1696722570463 |
±----------±---------------------±--------------------±----------------------±------------------+
1 row in set (0.01 sec)

On replicas:
root> select * from jamfsoftware.mobile_device_app_deployment_queue where device_id=246625 and mobile_device_app_id=28;
±----------±---------------------±--------------------±----------------------±------------------+
| device_id | mobile_device_app_id | date_deployed_epoch | current_state | last_update_epoch |
±----------±---------------------±--------------------±----------------------±------------------+
| 246625 | 28 | 1675373156420 | ManagedButUninstalled | 1696722297552 |
±----------±---------------------±--------------------±----------------------±------------------+
1 row in set (0.00 sec)

Missing Event on replicas which is in relaylog too:
#231007 23:45:08 server id 2 end_log_pos 510404732 CRC32 0x010ee88c Update_rows: table id 680 flags: STMT_END_F

UPDATE jamfsoftware.mobile_device_app_deployment_queue

WHERE

@1=246625 /* INT meta=0 nullable=0 is_null=0 */

@2=28 /* INT meta=0 nullable=0 is_null=0 */

@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */

@4=‘ManagedButUninstalled’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */

@5=1696722297552 /* LONGINT meta=0 nullable=0 is_null=0 */

SET

@1=246625 /* INT meta=0 nullable=0 is_null=0 */

@2=28 /* INT meta=0 nullable=0 is_null=0 */

@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */

@4=‘Prompting’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */

@5=1696722308447 /* LONGINT meta=0 nullable=0 is_null=0 */

at 510404732

Replica Binary log:
#231007 23:45:08 server id 2 end_log_pos 792452634 CRC32 0xc43d68b4 Update_rows: table id 166 flags: STMT_END_F

UPDATE jamfsoftware.mobile_device_app_deployment_queue

WHERE

@1=246625 /* INT meta=0 nullable=0 is_null=0 */

@2=28 /* INT meta=0 nullable=0 is_null=0 */

@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */

@4=‘ManagedButUninstalled’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */

@5=1696722297552 /* LONGINT meta=0 nullable=0 is_null=0 */

SET

@1=246625 /* INT meta=0 nullable=0 is_null=0 */

@2=28 /* INT meta=0 nullable=0 is_null=0 */

@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */

@4=‘Prompting’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */

@5=1696722308447 /* LONGINT meta=0 nullable=0 is_null=0 */

at 792452634

It is quite weird how this particular SQL update statement is not applied on all three replicas.

Replication variables:
root> show variables like ‘%replic%’;
±----------------------------------------------±------------------------+
| Variable_name | Value |
±----------------------------------------------±------------------------+
| group_replication_consistency | EVENTUAL |
| init_replica | |
| innodb_replication_delay | 0 |
| log_replica_updates | ON |
| log_slow_replica_statements | OFF |
| pseudo_replica_mode | OFF |
| replica_allow_batching | ON |
| replica_checkpoint_group | 512 |
| replica_checkpoint_period | 300 |
| replica_compressed_protocol | OFF |
| replica_enable_event | |
| replica_exec_mode | STRICT |
| replica_load_tmpdir | /path/to/data |
| replica_max_allowed_packet | 1073741824 |
| replica_net_timeout | 60 |
| replica_parallel_type | LOGICAL_CLOCK |
| replica_parallel_workers | 1 |
| replica_pending_jobs_size_max | 134217728 |
| replica_preserve_commit_order | ON |
| replica_skip_errors | OFF |
| replica_sql_verify_checksum | ON |
| replica_transaction_retries | 10 |
| replica_type_conversions | |
| replication_optimize_for_static_plugin_config | OFF |
| replication_sender_observe_commit_only | OFF |
| rpl_stop_replica_timeout | 31536000 |
| skip_replica_start | OFF |
| sql_replica_skip_counter | 0 |
±----------------------------------------------±------------------------+
28 rows in set (0.00 sec)

Hello Chanakya,

Please share server_id and UUID of the primary and replicas.
Also, are you using GTID replication or general replication?

Regards,
Denis Subbota.
Managed Services, Percona.

Primary: dedc2c0a-de85-11ec-80bb-48df37590a18 Serverid- 2

Replica-1; 6bb05d8c-65b1-11ee-9ec1-48df37152f58 Serverid- 3
Replica-2 61e3deb7-6590-11ee-acea-48df37559ff8 serverid- 1
Replica-3 5fb9b2b1-64e1-11ee-aa44-48df371e1de0 serverid- 4

GTID replication.

Hi Team,

Please help to check this issue.

Hi Team,

Please help to identify the issue

Hi Chanakya,
As per the binlogs on Primary and records on Replica, it is evident that there is data mismatch in Primary and replicas. MySQL is looking for exact record i.e. every matching column in replicas.

The record on Primary before update was:

device_id =246625
mobile_device_app_id =28
date_deployed_epoch =1675373156420
current_state =‘Prompting’
last_update_epoch =1696722308447

On replica the record is

device_id =246625
mobile_device_app_id =28
date_deployed_epoch =1675373156420
current_state =‘ ManagedButUninstalled’
last_update_epoch = 1696722297552

There could be two causes for this:
1- An update query was executed on Primary some time back with sql_log_bin=0 which caused the data discrepancy.
2- An update query was executed on all replicas to change the record. If the update query was executed without setting sql_log_bin=0 on replicas, there would be an errant gtid. You can find more about finding it and fixing it on this blog - https://percona.community/blog/2021/11/08/the-errant-gtid-pt1/

To make sure that its not repeated again, I would recommend running pt-table-checksum the cluster to fix the discepancies and then pt-table-sync to fix the discrepancies found.

We dont feel there was any transaction that was directly applied on replica nor was there any transaction that was executed with sql_bin_log off. The transaction that got skipped on replica was there on the binlog of primary.
I feel this has to do with no primary key on the table.
Any issue reported by others with similar issue?