Percona cluster abnormality version: 5.7

Hi Team,

We are currently running with 5 cluster nodes: Node: A,B,C,D,E
wsrep_cluster_address=gcomm://A,B,C,D,E (wsrep.cnf file for nodes in cluster)

We later thought of removing node:E from the cluster, and nothing affected much.
Here we did remove ip address of node:E from wsrep.cnf file for all nodes in the cluster.
wsrep_cluster_address=gcomm://A,B,C,D (wsrep.cnf file for nodes in cluster)

After 2 days.

We tried to add node:E into cluster and remove node:D from the cluster, as node D had some performance issues.

We stopped mysql service on node:D as we wanted to remove node:D from the cluster.
Then we added ip address under wsrep.cnf file for node:E
wsrep_cluster_address=gcomm://A,B,C,E
We then made similar ip changes in wsrep.cnf file for node:C,node B and node:A.
After making wsrep ip changes on node:A,B,C,E
We first took mysql service restart on node:A and mysql service started
Then we took mysql service restart on node:E and after sometime mysql service failed to restart, causing loss of data in data directory for node:E
Then we took mysql service restart on node:C and we lost data in data directory for node:C
Similarly, after mysql service restart we lost data in data directory for node:B

Did we follow wrong process to add node into cluster or is there any bug in percona cluster version 5.7.

Please provide complete and clear information so that we do not land up into data losses that we faced earlier.

Awaiting for your response.

Thank You !

Hey there @rahul_ambekar ,

You should have completely halted all further operations at this point and you should have determined why this failed. Proceeding without understanding an error will send you down a very bad path, which you discovered.

Go look at E’s logs. Why did starting mysql fail? I’m fairly certain that the failure of E is the same for the others.

It is not required that you modify the gcomm of every wsrep.cnf to add/remove servers from you cluster. This URI does not define cluster membership; it is only required to define some members. Other members will be auto-discovered.

5.7 has been dead for many years, and bugs for this version are not being fixed. You should consider upgrading to a version that is being maintained (8.0).

Dear Matthew,

I hope you’re doing well.

Thank you very much for your quick response for my last post.
We are currently running a 3-node Percona cluster in our non-production environment using Percona Cluster 5.7. We have noticed that when we restart the MySQL service on one of the secondary nodes (while the MySQL services on the other two nodes remain operational), we sometimes experience data loss on the node that was restarted.

This issue has been occurring frequently, and we are concerned it might be related to a bug in Percona Cluster 5.7. can you please provide guidance on how to prevent this data loss, and let us know if there are any known bugs or issues with this version that could be contributing to the problem?

Your advice and recommendations would be greatly appreciated.

Thank you for your assistance.

Best regards,
Rahul Ambekar.

What evidence do you have of this? Be specific; show examples.

5.7 is old/dead. You need to upgrade for bug fixes.

You have have only described your issue. You have not shown any log files or anything to help diagnose. I cannot help without more information.

Dear Matthew,

Thank you for your response. I apologize for not providing sufficient information earlier.
Please find below the error log that is occurring in our secondary cluster nodes:

State Transfer donor: Resource temporarily unavailable
2024-09-18T17:56:38.370132Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
2024-09-18T17:56:38.372045Z WSREP_SST: [ERROR] Possible timeout in receiving first data from donor in gtid/keyring stage
2024-09-18T17:56:38.374024Z WSREP_SST: [ERROR] ******************************************************
2024-09-18T17:56:38.376320Z WSREP_SST: [ERROR] Cleanup after exit with status:32
2024-09-18T17:56:38.389497Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘10.0.1.44’ –
datadir ‘/var/lib/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group-suffix ‘’ --parent ‘3537509’ --mysqld-version ‘5.7.43-47-57’ ‘’ : 32 (Broken pipe)
2024-09-18T17:56:38.389511Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2024-09-18T17:56:38.389513Z 0 [ERROR] WSREP: SST script aborted with error 32 (Broken pipe)
2024-09-18T17:56:38.389563Z 0 [ERROR] WSREP: SST failed: 32 (Broken pipe)
2024-09-18T17:56:38.389569Z 0 [ERROR] Aborting

This issue occurs when we restart mysql service on any one of the secondary nodes.
If you need any additional information, such as configuration files or a more detailed log, please let me know.
Thank you for your help in resolving this issue.

Thanks and Regards,
Rahul Ambekar

Check SElinux in permissive mode. Check that ports 4444, 4567, 4568, 3306 are open between all nodes.

Always. Error logs from both donor and joiner.

Hi Matthew,

SELinux is disabled, and the necessary ports (4444, 4567, 4568, 3306) are open. However, we’re seeing the following error in the logs from the master primary node:

--------------- innobackup.backup.log (END) ----------------------
2024-10-23T02:55:24.646324Z WSREP_SST: [ERROR] ******************************************************
2024-10-23T02:55:24.647088Z WSREP_SST: [ERROR] Cleanup after exit with status:22
2024-10-23T02:55:24.654755Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘donor’ --address ‘10.0.1.44:4444
/xtrabackup_sst//1’ --socket ‘/var/run/mysqld/mysqld.sock’ --datadir ‘/var/lib/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group
-suffix ‘’ --mysqld-version ‘5.7.44-48-57’ ‘’ --gtid ‘e5c36aa9-90e4-11ef-bf49-4b17eebabc53:3’ : 22 (Invalid argument)
2024-10-23T02:55:24.654787Z 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role ‘donor’ --address ‘10.0.1.44:4444/xtraback
up_sst//1’ --socket ‘/var/run/mysqld/mysqld.sock’ --datadir ‘/var/lib/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --defaults-group-suffix ’
’ --mysqld-version ‘5.7.44-48-57’ ‘’ --gtid ‘e5c36aa9-90e4-11ef-bf49-4b17eebabc53:3’
2024-10-23T02:55:24.655726Z 0 [Warning] WSREP: 0.0 (pxc-cluster-node-1): State transfer to 1.0 (pxc-cluster-node-2) failed: -22 (Invalid ar
gument)

It seems the SST process is failing with an “Invalid argument” error (status: 22). Could you kindly assist us in identifying the cause and suggest a resolution?

Thanks in advance for your help!

Best regards,
Rahul Ambekar

Yes, upgrade to PXC 8.0. Many processes, including the SST process, have been greatly enhanced and bug fixed since 5.7.

5.7 is dead. Even if we find a bug, it won’t get fixed. Your only option is to first upgrade to current version, and see if the issue remains. If it does, then we can investigate further.

Hi Matthew,

Thank you for the recommendation regarding upgrading to Percona XtraDB Cluster (PXC) 8.0. We understand that PXC 5.7 is now deprecated, with limited support and bug fixes, and I appreciate your advice to move to the latest version.

We will prioritize planning an upgrade to PXC 8.0, as this version includes bug fixes.
But our application oxid eshop is curretly with lower version and does not support Mysql8.0+

Thank you for your guidance.

Best regards,
Rahul Ambekar.

Dear Percona Team,

We are experiencing a connectivity issue within our Percona cluster setup for the OXID eShop application. The setup currently includes:

Master Node (A): Handles all writes (DDL, DML).
Secondary Nodes (B, C, D): Manage read operations (SELECT).
In our configuration, reads are distributed among nodes B, C, and D:

$this->dbHost = ‘A’;
$this->aSlaveHosts = array(‘B’, ‘C’, ‘D’);

However, if node D is down, read operations are routed to this node to fail at application end. This outage is directly impacting the application’s performance and stability.

We need your expertise and guidance to ensure the continuity of the application and avoid broken read requests, please assist with php failover configuration:

→ Removing node D temporarily from the read configuration.
→ Implementing a failover solution for automated re-routing of read operations in the event of node failures.

Please prioritize this issue and let us know the expected resolution. Your prompt attention will help restore application performance and stability.

Thank you,
Rahul Ambekar

I think you might have misunderstood me. 5.7 is not deprecated, it is dead, End-of-Life. There is no support, and no bug fixes.

What are you using for connection proxy? Your proxy should not be sending connections to D if D is down. This is a failure in your proxy, not in PXC. PXC does not route queries.

This sounds like you do not have a proxy between your application and PXC. You need this. You need to setup either ProxySQL or HAProxy to manage connections from app → PXC. PXC does not manage connections.

These forums are free support given from volunteer efforts from Percona staff. If you need higher priority, please contact us for a support contract.

Dear Percona Support Team,

I’m seeking guidance for a Percona XtraDB Cluster setup in a test environment involving high availability and data redundancy testing. Here’s the current setup and the issue encountered:

Setup:
Cluster Nodes: 4 test database nodes, with a 3-node cluster (Nodes 1, 2, and 3) and a 4th node as a backup.
Goal: To ensure seamless replication sync between Nodes 1 and 4 and to add Node 4 to the cluster when required.
Procedure:
Installed Percona Cluster on all 4 nodes.
Manually stopped MySQL services on Node 3, keeping replication in sync with Node 4.
Attempted to add Node 4 to the cluster, but encountered the following errors:

2024-10-25T14:37:46.849910Z 2 [Note] WSREP: /usr/sbin/mysqld: Terminated.
2024-10-25T14:37:46.852562Z WSREP_SST: [ERROR] Removing /var/lib/mysql//xtrabackup_galera_info file due to signal
2024-10-25T14:37:46.857031Z WSREP_SST: [ERROR] Removing file due to signal
2024-10-25T14:37:46.861614Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
2024-10-25T14:37:46.863387Z WSREP_SST: [ERROR] SST script interrupted
2024-10-25T14:37:46.864729Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
2024-10-25T14:37:46.866161Z WSREP_SST: [ERROR] Cleanup after exit with status:32

Request:
Could you please advise on the following:

Possible causes of the SST script interruption and resolution steps.
Best practices for setting up a 4-node configuration where the 4th node acts as an on-demand backup within the Percona cluster.
Your expertise on resolving this SST error and optimizing the node setup would be invaluable.

Thank you for your assistance

Best regards,
Rahul Ambekar

Hello @rahul_ambekar

These forums are not our direct support team. If you need immediate assistance please contact us for a support contract.

The log files you provided above are incomplete. We need to see more complete logs from joiner, and donor, including the xtrabackup log file created during the SST process.

Hi Matthew,

Could you please kindly review the below error logs for main master node (cluster) and backup slave node and provide guidance on identifying the root cause of the failure when the backup slave node attempts to join cluster.

Master(cluster node) error log:
Nothing found under master’s error log

Backup(slave node) error log:
2024-10-28T09:12:24.220826Z 2 [Note] WSREP: Current view of cluster as seen by this node
view ((empty))
2024-10-28T09:12:24.221006Z 2 [Note] WSREP: gcomm: closed
2024-10-28T09:12:24.221080Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2024-10-28T09:12:24.221122Z 0 [Note] WSREP: Flow-control interval: [100, 100]
2024-10-28T09:12:24.221128Z 0 [Note] WSREP: Received NON-PRIMARY.
2024-10-28T09:12:24.221134Z 0 [Note] WSREP: Shifting PRIMARY → OPEN (TO: 6370)
2024-10-28T09:12:24.221146Z 0 [Note] WSREP: Received self-leave message.
2024-10-28T09:12:24.221152Z 0 [Note] WSREP: Flow-control interval: [0, 0]
2024-10-28T09:12:24.221156Z 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2024-10-28T09:12:24.221161Z 0 [Note] WSREP: Shifting OPEN → CLOSED (TO: 6370)
2024-10-28T09:12:24.221168Z 0 [Note] WSREP: RECV thread exiting 0: Success
2024-10-28T09:12:24.221271Z 2 [Note] WSREP: recv_thread() joined.
2024-10-28T09:12:24.221302Z 2 [Note] WSREP: Closing replication queue.
2024-10-28T09:12:24.221308Z 2 [Note] WSREP: Closing slave action queue.
2024-10-28T09:12:24.221319Z 2 [Note] WSREP: /usr/sbin/mysqld: Terminated.
2024-10-28T09:12:24.224090Z WSREP_SST: [ERROR] Removing /var/lib/mysql//xtrabackup_galera_info file due to signal
2024-10-28T09:12:24.228598Z WSREP_SST: [ERROR] Removing file due to signal
2024-10-28T09:12:24.230073Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
2024-10-28T09:12:24.230556Z WSREP_SST: [ERROR] SST script interrupted
2024-10-28T09:12:24.231036Z WSREP_SST: [ERROR] ******************* FATAL ERROR **********************
2024-10-28T09:12:24.231699Z WSREP_SST: [ERROR] Cleanup after exit with status:32

Waiting for your response. Thank you for your support.

Best Regards,
Rahul Ambekar

Hello @rahul_ambekar,
As I said above, the log files you provided above are incomplete. We need to see more complete logs from joiner, and donor, including the xtrabackup log file created during the SST process. Please provide all 3 log files in order for us to help.

Hi Matthew,

I found the following error message in the application log file:

[2024-11-07 05:29:26] OXID Logger.ERROR: WSREP has not yet prepared node for application use ["[object] (OxidEsales\Eshop\Core\Exception\DatabaseErrorException(code: 1047): WSREP has not yet prepared node for application use at /var/www/prod/estar/eshop/releases/20241105113839/vendor/oxid-esales/oxideshop-ce/source/Core/Database/Adapter/Doctrine/Database.php:955, Doctrine\DBAL\Exception\DriverException(code: 0): An exception occurred while executing 'INSERT INTO oxcache ( oxid, oxexpire, oxreseton, oxsize, oxhits, oxshopid ) VALUES( ?, ?, ?, ?, ?, ? ) ON DUPLICATE KEY UPDATE oxexpire = ?, oxreseton = ?, oxsize = ?, oxhits = ?, oxshopid = ? ’ with params ["1a1a62ed7a1301c1f4dfe782d89b87e2", 1730989761, "ox|cid=3cf37043bed6dbac3c718d10b9fd837b|cl=details|anid=8357af32dcb0b9fe3a0d5d27b550286d", 344535, 0, 11, 1730989761, "ox|cid=3cf37043bed6dbac3c718d10b9fd837b|cl=details|anid=8357af32dcb0b9fe3a0d5d27b550286d", 344535, 0, 11]:\n\nSQLSTATE[08S01]: Communication link failure: 1047 WSREP has not yet prepared node for application use at

However, no errors were found in the database logs. The application was down for approximately 15 minutes.
Could you please assist in identifying a suitable solution to prevent such incidents in the future?

Thanks and Regards,
Rahul Ambekar

This error means that the node the app is connected to is not connected to the rest of the cluster, or has not finished any IST/SST processes to sync with the cluster.

You need to find the SST logs inside the $datadir for the donor node and the joiner node in order to understand what happened.