1 node out of sync and 1 node down out of 3 Node in MYSQL Percona Cluster

I have 3 Node MySQL Percona XDBCluster, Everything was working fine since last one year, on 12th July 2023 Xtrabackup failed on Node 2 and Node 3 with SQL Error (1047): WSREP has not yet prepared node for application use. When I checked Both Node 2 and Node 3 out of sync and disconnected from cluster. Node 1 running successfully. I am new in percona so do not know what exactly do in this scenario so I did search in google and restart the Node 2 Mysql Service but after when I try to Start mysql service it did not restarted. Following log I am getting in For Node 2 from Mysqld.log , this log of 12th July when i restart the Node 2 service.

2023-07-12T09:04:29.075812Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR **********************
2023-07-12T09:04:29.075929Z 0 [ERROR] [MY-000000] [WSREP-SST] Error while getting data from donor node: exit codes: 1 1
2023-07-12T09:04:29.076005Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 1296
2023-07-12T09:04:29.076114Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************************************************
2023-07-12T09:04:29.076847Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘10.1.1.2’ --datadir ‘/var/lib/mysql/’ --basedir ‘/usr/’ --plugindir ‘/usr/lib64/mysql/plugin/’ --defaults-file ‘/etc/my.cnf’ --defaults-group-suffix ‘’ --parent ‘1357854’ --mysqld-version ‘8.0.26-16.1’ ‘’ : 1 (Operation not permitted)
2023-07-12T09:04:29.076934Z 0 [ERROR] [MY-000000] [WSREP] Failed to read uuid:seqno from joiner script.
2023-07-12T09:04:29.076976Z 0 [ERROR] [MY-000000] [WSREP] SST script aborted with error 1 (Operation not permitted)
2023-07-12T09:04:29.077191Z 3 [Note] [MY-000000] [Galera] Processing SST received
2023-07-12T09:04:29.077294Z 3 [Note] [MY-000000] [Galera] SST received: 00000000-0000-0000-0000-000000000000:-1
2023-07-12T09:04:29.077355Z 3 [System] [MY-000000] [WSREP] SST completed
2023-07-12T09:04:29.077470Z 2 [Note] [MY-000000] [Galera] str_proto_ver_: 3 sst_seqno_: -1 cc_seqno: 158078484 req->ist_len(): 72
2023-07-12T09:04:29.077558Z 2 [ERROR] [MY-000000] [Galera] Application received wrong state:
Received: 00000000-0000-0000-0000-000000000000
Required: 06fc3fb9-93d2-11ec-9278-bf6de0e1019a
2023-07-12T09:04:29.077617Z 2 [ERROR] [MY-000000] [Galera] Application state transfer failed. This is unrecoverable condition, restart required.
2023-07-12T09:04:29.077661Z 2 [Note] [MY-000000] [Galera] ReplicatorSMM::abort()
2023-07-12T09:04:29.077714Z 2 [Note] [MY-000000] [Galera] Closing send monitor…
2023-07-12T09:04:29.077759Z 2 [Note] [MY-000000] [Galera] Closed send monitor.
2023-07-12T09:04:29.077800Z 2 [Note] [MY-000000] [Galera] gcomm: terminating thread
2023-07-12T09:04:29.077864Z 2 [Note] [MY-000000] [Galera] gcomm: joining thread
2023-07-12T09:04:29.077966Z 2 [Note] [MY-000000] [Galera] gcomm: closing backend
2023-07-12T09:04:30.085271Z 2 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node

Today is 25th July I am trying to restart mysql service again but I getting following error now.

Starting Percona XtraDB Cluster…
/usr/bin/mysql-systemd: line 60: /bin/rm: Argument list too long
mysql.service: Control process exited, code=exited status=126
mysql.service: Failed with result ‘exit-code’.
Failed to start Percona XtraDB Cluster.

On Node 1 I found only one member in /var/lib/mysql/gvwstate.dat file.

my_uuid: 00b52465-e318-11ec-8c3c-a21b3f463e22
#vwbeg
view_id: 3 00b52465-e318-11ec-8c3c-a21b3f463e22 140
bootstrap: 0
member: 00b52465-e318-11ec-8c3c-a21b3f463e22 0
#vwend

I want to up Node 2 and sync all 3 cluster, can someone please help what should I do now. I am thinking to remove node 2 and 3 from cluster and add it again, does it work ? Do i need to restart Node 1 ?

  1. can you share the logs of the donor ? which is the currently running node, share below from donor:
    SHOW GLOBAL STATUS LIKE 'wsrep_cluster%';
  2. please add the below to the config file (my.cnf) and start the node2 again and share the logs
[sst]
wsrep_debug=1
  1. please confirm you have same xtrabackup version on both server

here is the output

--------------------------------------------------------------------
SHOW GLOBAL STATUS LIKE 'wsrep_cluster%';
Node 1 :
--------------
|Variable_name|Value|
|---|---|
|wsrep_cluster_weight|1|
|wsrep_cluster_capabilities||
|wsrep_cluster_conf_id|63|
|wsrep_cluster_size|1|
|wsrep_cluster_state_uuid|06fc3fb9-93d2-11ec-9278-bf6de0e1019a|
|wsrep_cluster_status|Primary|
----------------------------------------------------------------------------------------------
Node 3:
---------------
|Variable_name|Value|
|---|---|
|wsrep_cluster_capabilities||
|wsrep_cluster_conf_id|18446744073709551615|
|wsrep_cluster_size|0|
|wsrep_cluster_state_uuid|06fc3fb9-93d2-11ec-9278-bf6de0e1019a|
|wsrep_cluster_status|Disconnected|
--------------------------------------------------------------------------------------------
Node 1 My.cnf
---------------------
# Template my.cnf for PXC
# Edit to your requirements.
[client]
socket=/var/lib/mysql/mysql.sock

[mysqld]
server-id=1
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
pxc_encrypt_cluster_traffic=OFF
#general_log=ON
#general_log_file=/var/log/mysql/general.log
slow_query_log=ON
slow_query_log_file=/var/log/mysql/mysql-slowquery.log


# Binary log expiration period is 604800 seconds, which equals 7 days
binlog_expire_logs_seconds=604800

######## wsrep ###############
# Path to Galera library
wsrep_provider=/usr/lib64/galera4/libgalera_smm.so

# Cluster connection URL contains IPs of nodes
#If no IP is found, this implies that a new cluster needs to be created,
#in order to do that you need to bootstrap this node
wsrep_cluster_address=gcomm://10.1.1.1,10.1.1.2,10.1.1.3

# In order for Galera to work correctly binlog format should be ROW
binlog_format=ROW

# Slave thread to use
wsrep_slave_threads=8
wsrep_log_conflicts=ON

# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=1
wsrep_auto_increment_control=OFF


# Node IP address
wsrep_node_address=10.1.1.1
# Cluster name
wsrep_cluster_name=xtradb_cluster03

#If wsrep_node_name is not specified,  then system hostname will be used
#wsrep_node_name=pxc-cluster-node-1

#pxc_strict_mode allowed values: DISABLED,PERMISSIVE,ENFORCING,MASTER
pxc_strict_mode=PERMISSIVE

# SST method
wsrep_sst_method=xtrabackup-v2

innodb_buffer_pool_size=10G
innodb_buffer_pool_instances=8
wait_timeout = 3600
connect_timeout = 600
lock_wait_timeout = 300
max_connections = 300
performance_schema_session_connect_attrs_size = 1024
default_storage_engine=InnoDB
innodb_flush_log_at_timeout = 3
innodb_read_io_threads = 32
innodb_write_io_threads = 16

# Node 2 My.cnf
# Template my.cnf for PXC
# Edit to your requirements.
[client]
socket=/var/lib/mysql/mysql.sock

[mysqld]
server-id=1
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
pxc_encrypt_cluster_traffic=OFF
#general_log=ON
#general_log_file=/var/log/mysql/general.log
slow_query_log=ON
slow_query_log_file=/var/log/mysql/mysql-slowquery.log


# Binary log expiration period is 604800 seconds, which equals 7 days
binlog_expire_logs_seconds=604800

######## wsrep ###############
# Path to Galera library
wsrep_provider=/usr/lib64/galera4/libgalera_smm.so

# Cluster connection URL contains IPs of nodes
#If no IP is found, this implies that a new cluster needs to be created,
#in order to do that you need to bootstrap this node
wsrep_cluster_address=gcomm://10.1.1.1,10.1.1.2,1.1.1.3

# In order for Galera to work correctly binlog format should be ROW
binlog_format=ROW

# Slave thread to use
wsrep_slave_threads=8
wsrep_log_conflicts

# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=1
wsrep_auto_increment_control=OFF

# Node IP address
wsrep_node_address=10.1.1.2
# Cluster name
wsrep_cluster_name=xtradb_cluster03

#If wsrep_node_name is not specified,  then system hostname will be used
#wsrep_node_name=pxc-cluster-node-1

#pxc_strict_mode allowed values: DISABLED,PERMISSIVE,ENFORCING,MASTER
pxc_strict_mode=PERMISSIVE

# SST method
wsrep_sst_method=xtrabackup-v2

innodb_buffer_pool_size=10G
innodb_buffer_pool_instances=8
wait_timeout = 3600
connect_timeout = 600
lock_wait_timeout = 300
max_connections = 300
performance_schema_session_connect_attrs_size = 1024
default_storage_engine=InnoDB

I have added below line in My.cnf file of Node 2 and start service again but there no logs inserted into log file.

# SST method
[sst]
wsrep_debug=1

All 3 node having same xtrabackup version, percona-xtrabackup-80.x86_64 8.0.27-19.1.el8

Is it possible first I sync Node 3 with Node 2 and then I work on Node 2 ?

Could you check why the joiner address (10.1.25.53) does not match the node’s ip address(10.1.1.X)? Is the IP address been changed?

2023-07-12T09:04:29.076847Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role ‘joiner’ --address ‘10.1.25.53’ --datadir ‘/var/lib/mysql/’ --basedir ‘/usr/’ --plugindir ‘/usr/lib64/mysql/plugin/’ --defaults-file ‘/etc/my.cnf’ --defaults-group-suffix ‘’ --parent ‘1357854’ --mysqld-version ‘8.0.26-16.1’ ‘’ : 1 (Operation not permitted)

No ip was not change, i just use dummy ip my.cnf file

This is line 60, https://github.com/percona/percona-xtradb-cluster/blob/8.0/build-ps/rpm/mysql-systemd#L60

What do you have inside your $datadir? Too many files for even rm to handle. Do some manual cleanup first then attempt to start again.

What kind of manual clean i should do ? Should I delete entire data directory ? My data dir is /var/lib/mysql and I am not able see what inside it, but it has more that 2 millions files. I am able to list few file and I can see it has log file like GRA_12_31686628_v2.log, is it safe to delete these file because its from 2022.

That is incredibly wrong. Your datadir should never really have more than a hundred files in it, not counting directories for logical databases.

Yes, manually remove all of those GRA_*.log files.

Sure I will do this and update you. Thanks Matthew

Hi matthew, Thanks for your help, after deleting all junk logs and than I restart the service and it was up and Node 2 sync with Node 1, I did same thing on node 3 and restart service on node 3 but it was not up because of following error.

2023-07-27T09:51:05.154288Z 12 [Note] [MY-000000] [WSREP] Synchronized with group, ready for connections
2023-07-27T09:51:05.154321Z 12 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2023-07-27T09:51:18.505162Z 25 [ERROR] [MY-012592] [InnoDB] Operating system error number 2 in a file operation.
2023-07-27T09:51:18.505221Z 25 [ERROR] [MY-012593] [InnoDB] The error means the system cannot find the path specified.
2023-07-27T09:51:18.505230Z 25 [ERROR] [MY-012216] [InnoDB] Cannot open datafile for read-only: ‘./airflow/log.ibd’ OS error: 71
2023-07-27T09:51:18.505387Z 25 [Warning] [MY-012049] [InnoDB] Cannot calculate statistics for table airflow.log because the .ibd file is missing. Please refer to http://dev.mysql.com/doc/refman/8.0/en/innodb-troubleshooting.html for how to resolve the issue.
2023-07-27T09:51:19.170307Z 28 [Warning] [MY-012049] [InnoDB] Cannot calculate statistics for table airflow.log because the .ibd file is missing. Please refer to http://dev.mysql.com/doc/refman/8.0/en/innodb-troubleshooting.html for how to resolve the issue.
2023-07-27T09:51:19.450669Z 25 [ERROR] [MY-012592] [InnoDB] Operating system error number 2 in a file operation.
2023-07-27T09:51:19.450706Z 25 [ERROR] [MY-012593] [InnoDB] The error means the system cannot find the path specified.
2023-07-27T09:51:19.450716Z 25 [ERROR] [MY-012216] [InnoDB] Cannot open datafile for read-only: ‘./airflow/alembic_version.ibd’ OS error: 71
2023-07-27T09:51:19.450813Z 25 [Warning] [MY-012049] [InnoDB] Cannot calculate statistics for table airflow.alembic_version because the .ibd file is missing. Please refer to http://dev.mysql.com/doc/refman/8.0/en/innodb-troubleshooting.html for how to resolve the issue.
2023-07-27T09:51:19.722895Z 28 [Warning] [MY-012049] [InnoDB] Cannot calculate statistics for table airflow.alembic_version because the .ibd file is missing. Please refer to http://dev.mysql.com/doc/refman/8.0/en/innodb-troubleshooting.html for how to resolve the issue.

I checked in /var/lib/msql/airflow data directory and no .ibd file present in all node. But Airflow Db is present and able to access it
from Node 1 and Node 2.

Thanks Matthew, For all help, I am able to resolved all issue and now cluster is up with all 3 node and all node in sync now.

Thanks

Awesome! Keep an eye on the $datadir. As I said earlier, you should never have that many files in your $datadir.