Replication ERROR: Missing transaction in replicas

Hi,

I have come across an error with replication.
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction ‘dedc2c0a-de85-11ec-80bb-48df37590a18:2004018060’ at source log ab_binlog.034612, end_log_pos 553967352; Could not execute Update_rows event on table jamfsoftware.mobile_device_app_deployment_queue; Can’t find record in ‘mobile_device_app_deployment_queue’, Error_code: 1032; Can’t find record in ‘mobile_device_app_deployment_queue’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event’s source log ab_binlog.034612, end_log_pos 553967352

Set up- One Primary and three replicas.

All three replicas failed at exact same position/update statement.

I looked at binary log of primary, event has been replicated and I see it in relay log of all three replicas.
Replicas has log-replica-updates enabled and binary log of replicas has the event (update statement which is missing on replicas that caused replication error)

Binary log event where replication failed:
#231007 23:49:30 server id 2 end_log_pos 553967352 CRC32 0xab6b23c9 Update_rows: table id 680 flags: STMT_END_F

UPDATE jamfsoftware.mobile_device_app_deployment_queue

WHERE

@1=246625 /* INT meta=0 nullable=0 is_null=0 */

@2=28 /* INT meta=0 nullable=0 is_null=0 */

@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */

@4=‘Prompting’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */

@5=1696722308447 /* LONGINT meta=0 nullable=0 is_null=0 */

SET

@1=246625 /* INT meta=0 nullable=0 is_null=0 */

@2=28 /* INT meta=0 nullable=0 is_null=0 */

@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */

@4=‘ManagedButUninstalled’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */

@5=1696722570463 /* LONGINT meta=0 nullable=0 is_null=0 */

at 553967352

Applied on Primary:
root> select * from jamfsoftware.mobile_device_app_deployment_queue where device_id=246625 and mobile_device_app_id=28;
±----------±---------------------±--------------------±----------------------±------------------+
| device_id | mobile_device_app_id | date_deployed_epoch | current_state | last_update_epoch |
±----------±---------------------±--------------------±----------------------±------------------+
| 246625 | 28 | 1675373156420 | ManagedButUninstalled | 1696722570463 |
±----------±---------------------±--------------------±----------------------±------------------+
1 row in set (0.01 sec)

On replicas:
root> select * from jamfsoftware.mobile_device_app_deployment_queue where device_id=246625 and mobile_device_app_id=28;
±----------±---------------------±--------------------±----------------------±------------------+
| device_id | mobile_device_app_id | date_deployed_epoch | current_state | last_update_epoch |
±----------±---------------------±--------------------±----------------------±------------------+
| 246625 | 28 | 1675373156420 | ManagedButUninstalled | 1696722297552 |
±----------±---------------------±--------------------±----------------------±------------------+
1 row in set (0.00 sec)

Missing Event on replicas which is in relaylog too:
#231007 23:45:08 server id 2 end_log_pos 510404732 CRC32 0x010ee88c Update_rows: table id 680 flags: STMT_END_F

UPDATE jamfsoftware.mobile_device_app_deployment_queue

WHERE

@1=246625 /* INT meta=0 nullable=0 is_null=0 */

@2=28 /* INT meta=0 nullable=0 is_null=0 */

@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */

@4=‘ManagedButUninstalled’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */

@5=1696722297552 /* LONGINT meta=0 nullable=0 is_null=0 */

SET

@1=246625 /* INT meta=0 nullable=0 is_null=0 */

@2=28 /* INT meta=0 nullable=0 is_null=0 */

@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */

@4=‘Prompting’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */

@5=1696722308447 /* LONGINT meta=0 nullable=0 is_null=0 */

at 510404732

Replica Binary log:
#231007 23:45:08 server id 2 end_log_pos 792452634 CRC32 0xc43d68b4 Update_rows: table id 166 flags: STMT_END_F

UPDATE jamfsoftware.mobile_device_app_deployment_queue

WHERE

@1=246625 /* INT meta=0 nullable=0 is_null=0 */

@2=28 /* INT meta=0 nullable=0 is_null=0 */

@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */

@4=‘ManagedButUninstalled’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */

@5=1696722297552 /* LONGINT meta=0 nullable=0 is_null=0 */

SET

@1=246625 /* INT meta=0 nullable=0 is_null=0 */

@2=28 /* INT meta=0 nullable=0 is_null=0 */

@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */

@4=‘Prompting’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */

@5=1696722308447 /* LONGINT meta=0 nullable=0 is_null=0 */

at 792452634

It is quite weird how this particular SQL update statement is not applied on all three replicas.

Replication variables:
root> show variables like ‘%replic%’;
±----------------------------------------------±------------------------+
| Variable_name | Value |
±----------------------------------------------±------------------------+
| group_replication_consistency | EVENTUAL |
| init_replica | |
| innodb_replication_delay | 0 |
| log_replica_updates | ON |
| log_slow_replica_statements | OFF |
| pseudo_replica_mode | OFF |
| replica_allow_batching | ON |
| replica_checkpoint_group | 512 |
| replica_checkpoint_period | 300 |
| replica_compressed_protocol | OFF |
| replica_enable_event | |
| replica_exec_mode | STRICT |
| replica_load_tmpdir | /path/to/data |
| replica_max_allowed_packet | 1073741824 |
| replica_net_timeout | 60 |
| replica_parallel_type | LOGICAL_CLOCK |
| replica_parallel_workers | 1 |
| replica_pending_jobs_size_max | 134217728 |
| replica_preserve_commit_order | ON |
| replica_skip_errors | OFF |
| replica_sql_verify_checksum | ON |
| replica_transaction_retries | 10 |
| replica_type_conversions | |
| replication_optimize_for_static_plugin_config | OFF |
| replication_sender_observe_commit_only | OFF |
| rpl_stop_replica_timeout | 31536000 |
| skip_replica_start | OFF |
| sql_replica_skip_counter | 0 |
±----------------------------------------------±------------------------+
28 rows in set (0.00 sec)

Hello Chanakya,

Please share server_id and UUID of the primary and replicas.
Also, are you using GTID replication or general replication?

Regards,
Denis Subbota.
Managed Services, Percona.

Primary: dedc2c0a-de85-11ec-80bb-48df37590a18 Serverid- 2

Replica-1; 6bb05d8c-65b1-11ee-9ec1-48df37152f58 Serverid- 3
Replica-2 61e3deb7-6590-11ee-acea-48df37559ff8 serverid- 1
Replica-3 5fb9b2b1-64e1-11ee-aa44-48df371e1de0 serverid- 4

GTID replication.

Hi Team,

Please help to check this issue.