Async replica on PRIMARY server fails with HA_ERR_KEY_NOT_FOUND only when replicating to InnoDB Cluster (Group Replication))

I am trying to migrate from 8.0 w/ a traditional async replication to InnoDB cluster.

To facilitated this, I’m doing the following:

  1. restoring a PS 8.0 xtrabackup backup of one of the existing replicas on a PS 8.4.4 test server
  2. adding users and permissions for InnoDB cluster over three servers
  3. Creating a new InnoDB cluster from node1, then adding the two new instances via “clone”
  4. on the primary server use CHANGE REPLICATION SOURCE with SOURCE_AUTO_POSITION=1 to connect the server to the upstream database server

Shortly after step 4, the SQL slave thread fails in a DELETE event with HA_ERR_KEY_NOT_FOUND. It is always the same table.

If I don’t add the cluster and just set up a replica, no problem. Replication will run for days.

I RESET REPLICA ALL after three days of successful replication to a stand-alone node, set up that node as the seed and set up the cluster, and then use CHANGE REPLICATION SOURCE w/ SOURCE_AUTO_POSITION=1 the primary server will run for a bit then fail again in a DELETE event in the same table.

I’m trying to figure out if I am doing anything wrong with regards to InnoDB Cluster/GR. It is the only variable in the mix.

If it is of any interest, here is the DDL for the table that generates the HA_ERR_KEY_NOT_FOUND:
CREATE TABLE cne_job (
CNEJOBID int unsigned NOT NULL AUTO_INCREMENT,
CNEBATCHJOBID int unsigned DEFAULT NULL,
SUBID int unsigned DEFAULT NULL,
category enum(‘VLB’,‘VKE’) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
service enum(‘qm’,‘randal’,‘arkenstone’,‘administer’) DEFAULT NULL,
service_job_id int DEFAULT NULL,
service_job_name varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
action enum(‘CREATE’,‘READ’,‘UPDATE’,‘DELETE’) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
state enum(‘ENQUEUE’,‘PROCESSING’,‘COMPLETE’) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL DEFAULT ‘ENQUEUE’,
error tinyint(1) NOT NULL DEFAULT ‘0’,
request_payload longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
response_payload longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
group varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
service_method varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
queue_id int unsigned DEFAULT NULL,
trigger varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
queue_resp text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
NODE_SUBID int unsigned DEFAULT NULL,
PRIMARY KEY (CNEJOBID),
KEY CNEBATCHJOBID (CNEBATCHJOBID),
KEY SUBID (SUBID),
KEY service_method (service_method,service,state)
) ENGINE=InnoDB AUTO_INCREMENT=2381921 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

I would do this again. And after it’s been running for a while, bootstrap this replica with GR. (ie: SET group_replication_bootstrap_group=ON;
START GROUP_REPLICATION; be sure to set other GR params too.)

This should start GR with a single member. Is async still running? Any errors? If not, let this run for a while. If after some time everything still works, bring in member #2 using CLONE plugin. See what happens.

I did as you suggested. Everything is OK with a single node, even one configured for group replication. Introduction of a second node results in replication failure.

I checked the relay log and found the row that is being deleted in a mult-row DELETE event. I took the first row and checked the database and indeed it is missing. I then checked the binary log for the first node (the one running async replication).

I found the following:

#250519 17:57:19 server id 4010  end_log_pos 217759868 CRC32 0x40451228         GTID    last_committed=60463    sequence_number=60537   rbr_only=yes    original_committed_timestamp=1747677440770521   immediate_commit_timestamp=1747677440780698     transaction_length=771427
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
# original_commit_timestamp=1747677440770521 (2025-05-19 17:57:20.770521 UTC)
# immediate_commit_timestamp=1747677440780698 (2025-05-19 17:57:20.780698 UTC)
/*!80001 SET @@session.original_commit_timestamp=1747677440770521*//*!*/;
/*!80014 SET @@session.original_server_version=80404*//*!*/;                                                            /*!80014 SET @@session.immediate_server_version=80404*//*!*/;                                                           SET @@SESSION.GTID_NEXT= 'fe3d8a69-342c-11f0-86a6-5600056de5a1:71'/*!*/;                                                # at 217759868                                                                                                          #250519 17:57:19 server id 4010  end_log_pos 217759957 CRC32 0x6b121ebb         Query   thread_id=158   exec_time=0     error_code=0
SET TIMESTAMP=1747677439/*!*/;
SET @@session.sql_auto_is_null=1/*!*/;
SET @@session.sql_mode=1149239296/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=2/*!*/;                                       BEGIN                                                                                                                   /*!*/;                                                                                                                  # at 217759957                                                                                                          #250519 17:57:19 server id 4010  end_log_pos 217760058 CRC32 0x4f294b4a         Table_map: `vultr`.`cne_job` mapped to number 895
# has_generated_invisible_primary_key=0
# at 217760058
#250519 17:57:19 server id 4010  end_log_pos 217768222 CRC32 0xc69355f7         Delete_rows: table id 895
...
#250519 17:57:19 server id 4010  end_log_pos 218531177 CRC32 0x4cacdf3c         Delete_rows: table id 895 flags: STMT_END_F
### DELETE FROM `vultr`.`cne_job`
### WHERE
###   @1=2291540
...

The DELETE is coming from server 4010 (the GR primary node) and has a GTID that is different from the one being executed by the slave: 6ec10a05-f0ff-11ee-a42f-ac1f6b643f90:1249830863

I absolutely know nothing manually deleted rows on the GR node.

I think this is enough information to file a bug. I don’t know how to repeat it though without a backup from our server, which I can’t provide.