Replication stop working

Hi all,

I am working in a proof of concept of Percona Operator (running Percona on OpenShift), I created the cluster03 pxc called cluster1 and I used the isSource: true to make this database as the source for replication. I also installed other two Percona Operator on OpenShift clusters: cluster01 & cluster02, and used the isSource: false and the proper replicationChannels & sourcesList. Everything works as expected, I do a small test creating a database called pacodb and a table called persons inside, and then verified on the target databases, and the pacodb & persons are there. I also was able to see the respective information in logs.
I requested another member of my team, to restore a wordpress database and then the replication stop working with the following error messages in logs:

SHOW SLAVE STATUS:
±-----------------------±----------------------±------------------+
| Queueing source event to the relay log | XXXXXXX-XXXX.us-west-2.elb.amazonaws.com | replication | 3306 | 60 | binlog.006634 | 197 | cluster1-pxc-0-relay-bin-awspxc1_to_bmhpxc1.005278 | 407 | binlog.005345 | Yes | No | | | | | | | 1008 | Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction ‘8a83d4e4-9db9-11ed-874b-0feb932eef43:150’ at master log binlog.005345, end_log_pos 426. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. | 0 | 197 | 81012563 | None | | 0 | No | | | | | | NULL | Yes | 0 | | 1008 | Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction ‘8a83d4e4-9db9-11ed-874b-0feb932eef43:150’ at master log binlog.005345, end_log_pos 426. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. | | 12115430 | 870c8fa7-9db9-11ed-acfc-0a580a220828 | mysql.slave_master_info | 0 | NULL | | 5 | | | 230203 18:19:35 | | | 8a83d4e4-9db9-11ed-874b-0feb932eef43:1-11:150:152:1080-1081:3356-3470 | 8a83d4e4-9db9-11ed-874b-0feb932eef43:1-11,
ba08841c-a3e1-11ed-b7b6-0a580a810275:1-3,
bc47e5d4-a3e1-11ed-b77a-529e83acd078:1-4 | 1 | | awspxc1_to_bmhpxc1 | | | 0 | |

From pxc pods on /var/lib/mysql:
bash-4.4$ grep -i wordpress mysqld-error.log.1
2023-02-03T16:44:46.512008Z 29 [Note] [MY-000000] [WSREP] Executing Query (drop database wordpress_stage) with write-set (-1) and exec_mode: local in TO Isolation mode
2023-02-03T16:44:46.513706Z 29 [Note] [MY-000000] [WSREP] Query (drop database wordpress_stage) with write-set (37) and exec_mode: toi replicated in TO Isolation mode
2023-02-03T16:44:46.513826Z 29 [Note] [MY-000000] [WSREP] ha_rollback_trans(29, FALSE) rolled back: drop database wordpress_stage: XXCan’t drop database ‘wordpress_stage’; database doesn’t exist;
2023-02-03T16:44:46.513844Z 29 [Note] [MY-000000] [WSREP] TO END: 37: drop database wordpress_stage
2023-02-03T16:44:46.515387Z 29 [Note] [MY-000000] [WSREP] Error buffer for thd 29 seqno 37, 87 bytes: ’ Can’t drop database ‘wordpress_stage’; database doesn’t exist, Error_code: 1008;’
2023-02-03T16:44:46.515682Z 0 [Note] [MY-000000] [Galera] Member 1(cluster1-pxc-0) initiates vote on bc47e5d4-a3e1-11ed-b77a-529e83acd078:37,c03e4cab91ad9bf4: Can’t drop database ‘wordpress_stage’; database doesn’t exist, Error_code: 1008;
2023-02-03T16:44:46.517055Z 0 [Note] [MY-000000] [Galera] Member 2(cluster1-pxc-1) initiates vote on bc47e5d4-a3e1-11ed-b77a-529e83acd078:37,c03e4cab91ad9bf4: Can’t drop database ‘wordpress_stage’; database doesn’t exist, Error_code: 1008;
2023-02-03T16:44:46.518036Z 29 [ERROR] [MY-010584] [Repl] Slave SQL for channel ‘awspxc1_to_bmhpxc1’: Worker 1 failed executing transaction ‘8a83d4e4-9db9-11ed-874b-0feb932eef43:150’ at master log binlog.005345, end_log_pos 426; Error ‘Can’t drop database ‘wordpress_stage’; database doesn’t exist’ on query. Default database: ‘wordpress_stage’. Query: ‘drop database wordpress_stage’, Error_code: MY-001008
2023-02-03T16:44:46.518076Z 29 [Note] [MY-000000] [WSREP] ha_rollback_trans(29, FALSE) rolled back: (null): XXCan’t drop database ‘wordpress_stage’; database doesn’t exist;
2023-02-03T16:44:46.518082Z 29 [Note] [MY-000000] [WSREP] ha_rollback_trans(29, TRUE) rolled back: (null): XXCan’t drop database ‘wordpress_stage’; database doesn’t exist;
2023-02-03T16:44:46.518088Z 29 [Note] [MY-000000] [WSREP] ha_rollback_trans(29, FALSE) rolled back: (null): XXCan’t drop database ‘wordpress_stage’; database doesn’t exist;
bash-4.4$

In brief, I understand that the replication stop working, mainly due this: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction ‘8a83d4e4-9db9-11ed-874b-0feb932eef43:150’ at master log binlog.005345, end_log_pos 426. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. And my guess is because on the cluster01 & cluster02 the replication databases, are trying to remove a database that in fact is not there: Can’t drop database ‘wordpress_stage’; database doesn’t exist, Error_code: 1008;

As I mentioned, we are planning to implement percona operator in our production applications, our management is now in talk with percona for the support prices and best schema. In the meanwhile, how can we solve this situation? What is the recommended action path?

Thanks in advance!
Paco

Hi @Francisco_Ruben_Jime,

We need to have more information about your deployment e.g., PXC version and operator version as well. I think I can give you some action path, check replica_skip_errors option