Hi,
I have come across an error with replication.
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction ‘dedc2c0a-de85-11ec-80bb-48df37590a18:2004018060’ at source log ab_binlog.034612, end_log_pos 553967352; Could not execute Update_rows event on table jamfsoftware.mobile_device_app_deployment_queue; Can’t find record in ‘mobile_device_app_deployment_queue’, Error_code: 1032; Can’t find record in ‘mobile_device_app_deployment_queue’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event’s source log ab_binlog.034612, end_log_pos 553967352
Set up- One Primary and three replicas.
All three replicas failed at exact same position/update statement.
I looked at binary log of primary, event has been replicated and I see it in relay log of all three replicas.
Replicas has log-replica-updates enabled and binary log of replicas has the event (update statement which is missing on replicas that caused replication error)
Binary log event where replication failed:
#231007 23:49:30 server id 2 end_log_pos 553967352 CRC32 0xab6b23c9 Update_rows: table id 680 flags: STMT_END_F
UPDATE jamfsoftware
.mobile_device_app_deployment_queue
WHERE
@1=246625 /* INT meta=0 nullable=0 is_null=0 */
@2=28 /* INT meta=0 nullable=0 is_null=0 */
@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */
@4=‘Prompting’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */
@5=1696722308447 /* LONGINT meta=0 nullable=0 is_null=0 */
SET
@1=246625 /* INT meta=0 nullable=0 is_null=0 */
@2=28 /* INT meta=0 nullable=0 is_null=0 */
@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */
@4=‘ManagedButUninstalled’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */
@5=1696722570463 /* LONGINT meta=0 nullable=0 is_null=0 */
at 553967352
Applied on Primary:
root> select * from jamfsoftware
.mobile_device_app_deployment_queue
where device_id=246625 and mobile_device_app_id=28;
±----------±---------------------±--------------------±----------------------±------------------+
| device_id | mobile_device_app_id | date_deployed_epoch | current_state | last_update_epoch |
±----------±---------------------±--------------------±----------------------±------------------+
| 246625 | 28 | 1675373156420 | ManagedButUninstalled | 1696722570463 |
±----------±---------------------±--------------------±----------------------±------------------+
1 row in set (0.01 sec)
On replicas:
root> select * from jamfsoftware
.mobile_device_app_deployment_queue
where device_id=246625 and mobile_device_app_id=28;
±----------±---------------------±--------------------±----------------------±------------------+
| device_id | mobile_device_app_id | date_deployed_epoch | current_state | last_update_epoch |
±----------±---------------------±--------------------±----------------------±------------------+
| 246625 | 28 | 1675373156420 | ManagedButUninstalled | 1696722297552 |
±----------±---------------------±--------------------±----------------------±------------------+
1 row in set (0.00 sec)
Missing Event on replicas which is in relaylog too:
#231007 23:45:08 server id 2 end_log_pos 510404732 CRC32 0x010ee88c Update_rows: table id 680 flags: STMT_END_F
UPDATE jamfsoftware
.mobile_device_app_deployment_queue
WHERE
@1=246625 /* INT meta=0 nullable=0 is_null=0 */
@2=28 /* INT meta=0 nullable=0 is_null=0 */
@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */
@4=‘ManagedButUninstalled’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */
@5=1696722297552 /* LONGINT meta=0 nullable=0 is_null=0 */
SET
@1=246625 /* INT meta=0 nullable=0 is_null=0 */
@2=28 /* INT meta=0 nullable=0 is_null=0 */
@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */
@4=‘Prompting’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */
@5=1696722308447 /* LONGINT meta=0 nullable=0 is_null=0 */
at 510404732
Replica Binary log:
#231007 23:45:08 server id 2 end_log_pos 792452634 CRC32 0xc43d68b4 Update_rows: table id 166 flags: STMT_END_F
UPDATE jamfsoftware
.mobile_device_app_deployment_queue
WHERE
@1=246625 /* INT meta=0 nullable=0 is_null=0 */
@2=28 /* INT meta=0 nullable=0 is_null=0 */
@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */
@4=‘ManagedButUninstalled’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */
@5=1696722297552 /* LONGINT meta=0 nullable=0 is_null=0 */
SET
@1=246625 /* INT meta=0 nullable=0 is_null=0 */
@2=28 /* INT meta=0 nullable=0 is_null=0 */
@3=1675373156420 /* LONGINT meta=0 nullable=0 is_null=0 */
@4=‘Prompting’ /* VARSTRING(93) meta=93 nullable=0 is_null=0 */
@5=1696722308447 /* LONGINT meta=0 nullable=0 is_null=0 */
at 792452634
It is quite weird how this particular SQL update statement is not applied on all three replicas.
Replication variables:
root> show variables like ‘%replic%’;
±----------------------------------------------±------------------------+
| Variable_name | Value |
±----------------------------------------------±------------------------+
| group_replication_consistency | EVENTUAL |
| init_replica | |
| innodb_replication_delay | 0 |
| log_replica_updates | ON |
| log_slow_replica_statements | OFF |
| pseudo_replica_mode | OFF |
| replica_allow_batching | ON |
| replica_checkpoint_group | 512 |
| replica_checkpoint_period | 300 |
| replica_compressed_protocol | OFF |
| replica_enable_event | |
| replica_exec_mode | STRICT |
| replica_load_tmpdir | /path/to/data |
| replica_max_allowed_packet | 1073741824 |
| replica_net_timeout | 60 |
| replica_parallel_type | LOGICAL_CLOCK |
| replica_parallel_workers | 1 |
| replica_pending_jobs_size_max | 134217728 |
| replica_preserve_commit_order | ON |
| replica_skip_errors | OFF |
| replica_sql_verify_checksum | ON |
| replica_transaction_retries | 10 |
| replica_type_conversions | |
| replication_optimize_for_static_plugin_config | OFF |
| replication_sender_observe_commit_only | OFF |
| rpl_stop_replica_timeout | 31536000 |
| skip_replica_start | OFF |
| sql_replica_skip_counter | 0 |
±----------------------------------------------±------------------------+
28 rows in set (0.00 sec)