Dear All,
we have a 3-nodes Group Replication cluster, single master, running on Percona Server 8.0.29-21 (on Debian 11 VMs).
A few days ago, we did a routine maintenance operation (resize of the disks). We detached and reattached the slaves one by one, and then we promoted one of them as new master in order to resize the third volume as well.
When the last node (the former master) rejoined the cluster as a slave, it returned an error " Slave SQL for channel ‘group_replication_recovery’: Worker 1 failed executing transaction"
The subsequent log line said " Slave SQL for channel ‘group_replication_recovery’: … The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically," so we gave it a try. The restart worked and we were assuming that the node was OK.
Fast forward a few days and that node had another issue with storage, meaning that there was not enough space left on disk due to a mis-configuration on our side.
Now the dreaded " Slave SQL for channel ‘group_replication_recovery’: Worker 1 failed executing transaction" error has returned and no amount of restarts will fix it.
Log excerpt:
Mar 09 16:38:07 mysql-cluster-red-1 mysqld[1555626]: Slave I/O thread for channel 'group_replication_recovery': connected to master 'replicator@mysql-cluster-red-2:3306',replication started in log 'FIRST' at position 4
Mar 09 16:38:29 mysql-cluster-red-1 mysqld[1555626]: Slave SQL for channel 'group_replication_recovery': Worker 1 failed executing transaction '4f1ab3dc-d037-4dc5-a5f5-d66eee3fc61a:76248523' at master log , end_log_pos 338219758; Could not execute Delete_rows event on table zabbix_A.history; Can't find record in 'history', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 338219758, Error_code: MY-001032
Mar 09 16:38:29 mysql-cluster-red-1 mysqld[1555626]: Slave SQL for channel 'group_replication_recovery': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: MY-001756
Mar 09 16:38:29 mysql-cluster-red-1 mysqld[1555626]: 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='mysql-cluster-red-2', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='mysql-cluster-red-3', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
Mar 09 16:38:29 mysql-cluster-red-1 mysqld[1555626]: Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
Mar 09 16:38:29 mysql-cluster-red-1 mysqld[1555626]: Slave I/O thread for channel 'group_replication_recovery': connected to master 'replicator@mysql-cluster-red-3:3306',replication started in log 'FIRST' at position 4
Mar 09 16:38:33 mysql-cluster-red-1 mysqld[1555626]: Slave SQL for channel 'group_replication_recovery': Worker 1 failed executing transaction '4f1ab3dc-d037-4dc5-a5f5-d66eee3fc61a:76248523' at master log , end_log_pos 64529304; Could not execute Delete_rows event on table zabbix_A.history; Can't find record in 'history', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 64529304, Error_code: MY-001032
Mar 09 16:38:33 mysql-cluster-red-1 mysqld[1555626]: Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-binlog.000107' position 64527523
Mar 09 16:38:33 mysql-cluster-red-1 mysqld[1555626]: Plugin group_replication reported: 'Maximum number of retries when trying to connect to a donor reached. Aborting group replication incremental recovery.'
Mar 09 16:38:33 mysql-cluster-red-1 mysqld[1555626]: Plugin group_replication reported: 'Fatal error during the incremental recovery process of Group Replication. The server will leave the group.'
Mar 09 16:38:33 mysql-cluster-red-1 mysqld[1555626]: Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
Mar 09 16:38:37 mysql-cluster-red-1 mysqld[1555626]: Plugin group_replication reported: 'Group membership changed: This member has left the group.'
In case of a GR cluster, when a node has drifted from the rest, what is the recommended best practice? Shall we just delete the data on this node and issue a START GROUP_REPLICATION
afterwards?
Thank you in advance for any pointers.
Regards,
Corrado Fiore