Auto Cloning for Distributed Recovery is not working in Mysql GR cluster

Hi Team,

I have a 3-node MySQL GR(group replication) single primary cluster in k8s env.

Sample:

mysql> select MEMBER_HOST,MEMBER_STATE,MEMBER_ROLE from performance_schema.replication_group_members;
+--------------------------------------------------------------------------------------+--------------+-------------+
| MEMBER_HOST                                                                          | MEMBER_STATE | MEMBER_ROLE |
+--------------------------------------------------------------------------------------+--------------+-------------+
| mysql-gr-auto-test-0.xxxx.svc.cluster.local | ONLINE       | PRIMARY     |
| mysql-gr-auto-test-1.xxxx.svc.cluster.local | ONLINE       | SECONDARY   |
| mysql-gr-auto-test-2.xxxx.svc.cluster.local | ONLINE       | SECONDARY   |
+--------------------------------------------------------------------------------------+--------------+-------------+

Percona mysql version: 8.0.35-27

mysql> show plugins;
+----------------------------------+----------+--------------------+----------------------+---------+
| group_replication                | ACTIVE   | GROUP REPLICATION  | group_replication.so | GPL     |
| clone                            | ACTIVE   | CLONE              | mysql_clone.so       | GPL     |
+----------------------------------+----------+--------------------+----------------------+---------+

I have all end-to-end set up done for the clone plugin in all 3 nodes.
The problem is the automatic distributed recovery(via clone plugin) is not working here, but when I try manually with the same user it’s working as expected.

Can you please help me identify what is missing here, even if all the variables are user privileges are correctly set why the automatic plugin rebuild is not happening in case of need?

User Privilege details:
mysql> show grants for 'repl_usr'@'%';
+---------------------------------------------------------------------------------------------------+
| Grants for repl_usr@%                                                                             |
+---------------------------------------------------------------------------------------------------+
| GRANT REPLICATION SLAVE ON *.* TO `repl_usr`@`%`                                                  |
| GRANT BACKUP_ADMIN,CLONE_ADMIN,CONNECTION_ADMIN,GROUP_REPLICATION_STREAM ON *.* TO `repl_usr`@`%` |
+---------------------------------------------------------------------------------------------------+

Sample manual command that is working as expected:
CLONE INSTANCE FROM 'repl_usr'@'mysql-gr-auto-test-0.xxxx.svc.cluster.local':3306 IDENTIFIED BY 'xxxx';

Pat of mysql conf file:

    plugin-load-add=mysql_clone.so
    clone=FORCE_PLUS_PERMANENT
    clone_delay_after_data_drop=5
    clone_valid_donor_list='mysql-gr-auto-test-0.xxxx.svc.cluster.local:3306,mysql-gr-auto-test-1.xxxx.svc.cluster.local:3306,mysql-gr-auto-test-2.mysql-gr-auto-test.xxxx.svc.cluster.local:3306'

Let me know in case of further info is needed here.

What is your value for group_replication_clone_threshold?
Did you configure the group_replication_recovery channel?

What do the error logs say when you START GROUP_REPLICATION?

Hi Matt,

I have tried with the default value for group_replication_clone_threshold as well as with as low as 1 / 1000 too, auto clone rebuild is not kicking in at all in either case.

Yes, group_replication_recovery is correctly set, below is sample used command:
CHANGE REPLICATION SOURCE TO SOURCE_USER='repl_usr', SOURCE_PASSWORD='xxxx' FOR CHANNEL 'group_replication_recovery';

Run in one of the secondary:
mysql> show slave status for channel ‘group_replication_recovery’\G

.....
....
...
..
.
           Retrieved_Gtid_Set:
            Executed_Gtid_Set: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-1637
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name: group_replication_recovery
           Master_TLS_Version: TLSv1.2,TLSv1.3
       Master_public_key_path:
        Get_master_public_key: 0
            Network_Namespace:
1 row in set, 1 warning (0.00 sec)

And regarding the error, there is nothing specific we see post START GROUP_REPLICATION command, we only see that the member is isolated and went to OFFLINE state.

To be Expected one:

...
..
.
Distributed recovery will transfer data using: Cloning from a remote group donor.```


Current Error:
mysql> start group_replication;
```ERROR 3092 (HY000): The server is not configured properly to be an active member of the group. Please see more details on error log.```


POD logs:

2024-07-18T07:45:19.572415Z 0 [ERROR] [MY-011526] [Repl] Plugin group_replication reported: ‘This member has more executed transactions than those present in the group. Local transactions: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-7500, f2b5bf33-44cb-11ef-b3ff-227f4b055a25:1-2 > Group transactions: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-7597’
2024-07-18T07:45:19.572447Z 0 [ERROR] [MY-011522] [Repl] Plugin group_replication reported: ‘The member contains transactions not present in the group. The member will now exit the group.’
2024-07-18T07:45:19.572478Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: ‘Group membership changed to mysql-gr-auto-test-0.xxxx.svc.cluster.local:3306, mysql-gr-auto-test-1.xxxx.svc.cluster.local:3306, mysql-gr-auto-test-2.xxxx.svc.cluster.local:3306 on view 17212828721342461:5.’
2024-07-18T07:45:22.720055Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: ‘Group membership changed: This member has left the group.’
2024-07-18T07:45:22.721055Z 619 [System] [MY-011566] [Repl] Plugin group_replication reported: ‘Setting super_read_only=OFF.’```

Before:
mysql> select MEMBER_HOST,MEMBER_STATE,MEMBER_ROLE from performance_schema.replication_group_members;

| MEMBER_HOST                                                                          | MEMBER_STATE | MEMBER_ROLE |
+--------------------------------------------------------------------------------------+--------------+-------------+
| mysql-gr-auto-test-0.xxxx.svc.cluster.local | ONLINE       | PRIMARY     |
| mysql-gr-auto-test-1.xxxx.svc.cluster.local | ONLINE       | SECONDARY   |
| mysql-gr-auto-test-2.xxxx.svc.cluster.local | ONLINE       | SECONDARY   |
+--------------------------------------------------------------------------------------+--------------+-------------+```


After:
mysql> select MEMBER_HOST,MEMBER_STATE,MEMBER_ROLE from performance_schema.replication_group_members;
```+--------------------------------------------------------------------------------------+--------------+-------------+
| MEMBER_HOST                                                                          | MEMBER_STATE | MEMBER_ROLE |
+--------------------------------------------------------------------------------------+--------------+-------------+
| mysql-gr-auto-test-2.xxxx.svc.cluster.local | OFFLINE      |             |
+--------------------------------------------------------------------------------------+--------------+-------------+```


Let me know if further details are needed here.

@pravata_dash are you using the Operator? If you are, please provide the version and cr.yaml to reproduce the prob

Hi Sergey,

Currently, there is no operator in the picture, it’s a simple 3-node single primary GR cluster.
Sample:

kcl get po -w | grep -i test
mysql-gr-auto-test-0                               2/2     Running            0                  7h31m
mysql-gr-auto-test-1                               2/2     Running            0                  7h30m
mysql-gr-auto-test-2                               2/2     Running            0                  7h29m

The thought process here is, in most scenarios, GR should auto-heal itself as the clone should kick in and mostly handle all the automatic rebuild flows.
Sample scenarios where we expect GR clone to do automatic rebuild:

Large State Difference Between Nodes
Data Corruption on a Node
New Node Joining the Cluster
Manual Removal of Data Directory
Etc

But in either of the cases, auto clone is not happening where as manually its working fine.

I would see if our MySQL experts have some suggestions.

From k8s perspective - try the operator instead. We have autorecovery figured out.

@Ege_Gunes @Marco.Tusa if there is something on top of your mind why clone might not kick in in k8s (no operator) - pls share.

@pravata_dash from the error you have reported:

2024-07-18T07:45:19.572415Z 0 [ERROR] [MY-011526] [Repl] Plugin group_replication reported: ‘This member has more executed transactions than those present in the group. Local transactions: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-7500, f2b5bf33-44cb-11ef-b3ff-227f4b055a25:1-2 > Group transactions: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-7597’
2024-07-18T07:45:19.572447Z 0 [ERROR] [MY-011522] [Repl] Plugin group_replication reported: ‘The member contains transactions not present in the group. The member will now exit the group.’

It is clear that the node that is trying to join the group as more transaction than the group itself.
so the problem is not in the clone but in trx executed in the joining node.
Clean that up and try to rejoin

Thanks for the clarification @Marco.Tusa .
So, the auto clone will kick in those cases(large data diff, data lost etc) only if both the doner and receiver are in sync OR the receiver/joiner has fewer transactions than the source/primary.

Are there any additional details regarding when auto-clone will or will not take place that I should know?