Problem with SST, never works the first time

Hi all,
i have 3 PXC clusters
the first one PXC is master-master (called A)
the second one PXC is master-master (called B) but slave of PXC called A
the third one PXC is master-master (called C) but slave of PXC called B
if i get one node out of one of these PXC and if IST can’t be got, the SST will never works the first time, i need to restart the mysql daemon 2-3-4 times to get the SST finishing and get the node back in the PXC.
It happened always from day 0 of the life of each of these PXC clusters.
I read several way to fix it and i added this to my.cnf configuration for all the PXC clusters, but it continues to happen and i would like to know if it is normal or not.
/etc/my.cnf
wsrep_sst_method                = xtrabackup-v2
[xtrabackup]open-files-limit = 1000000
I suppose this is not enough to fix the problem.
My DBs total size is around 10GB
This is the log on the donor node, after the streaming of the last db
May  4 13:57:59 XXXXX -innobackupex-backup: 200504 13:57:59 [01] Streaming ./wp_w3dev/wp_postmeta.ibdMay  4 13:57:59 XXXXX -innobackupex-backup: 200504 13:57:59 [01]        …doneMay  4 13:58:10 XXXXX -wsrep-sst-donor: Total time on donor: 0 secondsMay  4 13:58:10 XXXXX -wsrep-sst-donor: Cleaning up temporary directoriesMay  4 13:58:30 XXXXX journal: Suppressed 7644 messages from /system.slice/system-mysql.sliceMay  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: forgetting fc92078e (tcp://172.17.26.22:4567)May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Node 8797245b state primMay  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: view(view_id(PRIM,8797245b,5) memb {May  4 13:58:30 XXXXX mysqld: #0118797245b,0May  4 13:58:30 XXXXX mysqld: } joined {May  4 13:58:30 XXXXX mysqld: } left {May  4 13:58:30 XXXXX mysqld: } partitioned {May  4 13:58:30 XXXXX mysqld: #011fc92078e,0May  4 13:58:30 XXXXX mysqld: })May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: save pc into diskMay  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: forgetting fc92078e (tcp://172.17.26.22:4567)May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 92594d64-8dfe-11ea-abd2-8ee613652447May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: STATE EXCHANGE: sent state msg: 92594d64-8dfe-11ea-abd2-8ee613652447May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: STATE EXCHANGE: got state msg: 92594d64-8dfe-11ea-abd2-8ee613652447 from 0 (perconaslv1)May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Quorum results:May  4 13:58:30 XXXXX mysqld: #011version    = 6,May  4 13:58:30 XXXXX mysqld: #011component  = PRIMARY,May  4 13:58:30 XXXXX mysqld: #011conf_id    = 4,May  4 13:58:30 XXXXX mysqld: #011members    = 1/1 (joined/total),May  4 13:58:30 XXXXX mysqld: #011act_id     = 57039,May  4 13:58:30 XXXXX mysqld: #011last_appl. = 56989,May  4 13:58:30 XXXXX mysqld: #011protocols  = 0/9/3 (gcs/repl/appl),May  4 13:58:30 XXXXX mysqld: #011group UUID = 8798a126-8dfc-11ea-8a03-ce798970727aMay  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Flow-control interval: [16, 16]May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Trying to continue unpaused monitorMay  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: New cluster view: global state: 8798a126-8dfc-11ea-8a03-ce798970727a:57039, view# 5: Primary, number of nodes: 1, my index: 0, protocol version 3May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Setting wsrep_ready to trueMay  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 2 -> 1)May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: REPL Protocols: 9 (4, 2)May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Assign initial position for certification: 57039, protocol version: 4May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Service thread queue flushed.May  4 13:58:35 XXXXX mysqld: 2020-05-04 13:58:35 4092 [Note] WSREP:  cleaning up fc92078e (tcp://172.17.26.22:4567)May  4 14:01:01 XXXXX systemd: Started Session 149 of user root.
Maybe there is some other option to fix this problem ?
Claudio
These are the percona packages installed:
rpm -qa | grep -i perconaPercona-XtraDB-Cluster-shared-56-5.6.46-28.38.1.el7.x86_64percona-release-1.0-17.noarchPercona-XtraDB-Cluster-galera-3-3.38-1.el7.x86_64percona-toolkit-3.2.0-1.el7.x86_64percona-xtrabackup-2.3.10-1.el7.x86_64Percona-XtraDB-Cluster-client-56-5.6.46-28.38.1.el7.x86_64Percona-XtraDB-Cluster-server-56-5.6.46-28.38.1.el7.x86_64Percona-XtraDB-Cluster-56-5.6.46-28.38.1.el7.x86_64

in the log of the donor i got this
May  4 14:08:01 XXXXX -innobackupex-backup: xtrabackup: open files limit requested 1000000, set to 10005
can be this the problem ? how can i fix it ?
Claudio

this is that i found in the joiner node in my 5th attempt, after the SST finished and also the move of the directories, there is also completed OK!
the problem is after i bolded it
May  4 14:23:52 XXXXX2 -innobackupex-move: 200504 14:23:52 [01]        …doneMay  4 14:23:52 XXXXX2 -innobackupex-move: 200504 14:23:52 completed OK!May  4 14:23:52 XXXXX2 -wsrep-sst-joiner: Move successful, removing /var/lib/mysql//.sstMay  4 14:23:52 XXXXX2 -wsrep-sst-joiner: Galera co-ords from recovery: 8798a126-8dfc-11ea-8a03-ce798970727a:61728May  4 14:23:52 XXXXX2 -wsrep-sst-joiner: Total time on joiner: 0 secondsMay  4 14:23:52 XXXXX2 -wsrep-sst-joiner: Removing the sst_in_progress fileMay  4 14:24:04 XXXXX2 journal: Suppressed 13701 messages from /system.slice/mysql.serviceMay  4 14:24:04 XXXXX2 mysqld: Error in my_thread_global_end(): 1 threads didn’t exitMay  4 14:24:04 XXXXX2 mysqld_safe: 200504 14:24:04 mysqld_safe mysqld from pid file /run/mysqld/mysql.pid endedMay  4 14:24:04 XXXXX2 mysqld_safe: mysqld from pid file /run/mysqld/mysql.pid ended
What does it mean ? Is it a bug or what ?
It happens each time i add or want to recover a node and IST can’t work so SST is used.
Any hint ?
Claudio

Anyway on the second PXC called B i can’t join any node, always i have
mysqld: Error in my_thread_global_end(): 1 threads didn’t exit
after the SST stream, after the move of the DB files, after the completed OK
so now i have the second cluster with only a node (the slave node from PXC called A) that is on in bootstrap mode but that can’t make join any other node.
I found that there was a similar bug fixed, https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1190774
mine is 1 thread instead of two.
Any hint ?
Claudio

May  4 14:24:04 XXXXX2 journal: Suppressed 13701 messages from /system.slice/mysql.service
Do you know what was suppressed ?
Also I’d check the log files from SST script from Donor and Joiner  they may contain some details.
This message is unlikely the main issue as it simply says what some  threads did not issue 

You may also show how logs look different when it succeeds vs when it fails - it may point to some ideas.

Hi peter, the problem now is that i have a slave PXC with only a bootstrapped node that is slave of the first PXC.
If i try to add a node (i have tried with both two i have) all seems went good up to the error upper,
i don’t understand because, sst is over (so it was OK), moving is over (so no problem on files), the only thing that is missing maybe is a restart of mysql
Claudio

Dear Peter, 
now i tried again to join node 2 on the slave PXC and i got on the joiner
May  8 18:01:06 XXXX2 -innobackupex-move: 200508 18:01:06 completed OK!May  8 18:01:06 XXXX2 -wsrep-sst-joiner: Move successful, removing /var/lib/mysql//.sstMay  8 18:01:06 XXXX2 -wsrep-sst-joiner: Galera co-ords from recovery: 8798a126-8dfc-11ea-8a03-ce798970727a:979821May  8 18:01:06 XXXX2 -wsrep-sst-joiner: Total time on joiner: 0 secondsMay  8 18:01:06 XXXX2 -wsrep-sst-joiner: Removing the sst_in_progress fileMay  8 18:01:18 XXXX2 mysqld_safe: mysqld from pid file /run/mysqld/mysql.pid endedMay  8 18:01:18 XXXX2 mysqld_safe: 200508 18:01:18 mysqld_safe mysqld from pid file /run/mysqld/mysql.pid endedMay  8 18:01:19 XXXX2 systemd: mysql.service: control process exited, code=exited status=1May  8 18:01:19 XXXX2 mysql-systemd: /usr/bin/mysql-systemd: line 134: kill: (18574) - No such processMay  8 18:01:19 XXXX2 mysql-systemd: ERROR! mysqld_safe with PID 18574 has already exited: FAILUREMay  8 18:01:19 XXXX2 mysql-systemd: WARNING: mysql pid file /run/mysqld/mysql.pid empty or not readableMay  8 18:01:19 XXXX2 mysql-systemd: ERROR! mysql already deadMay  8 18:01:19 XXXX2 systemd: mysql.service: control process exited, code=exited status=2May  8 18:01:19 XXXX2 mysql-systemd: WARNING: mysql pid file /run/mysqld/mysql.pid empty or not readableMay  8 18:01:19 XXXX2 mysql-systemd: WARNING: mysql may be already deadMay  8 18:01:19 XXXX2 systemd: Failed to start Percona XtraDB Cluster.May  8 18:01:19 XXXX2 systemd: Unit mysql.service entered failed state.May  8 18:01:19 2 systemd: mysql.service failed.
this is the donor 
May  8 18:00:51 XXXX1 -wsrep-sst-donor: Total time on donor: 0 secondsMay  8 18:00:51 XXXX1 -wsrep-sst-donor: Cleaning up temporary directoriesMay  8 18:01:01 XXXX1 systemd: Started Session 252 of user root.May  8 18:01:07 XXXX1 journal: Suppressed 7235 messages from /system.slice/system-mysql.sliceMay  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: forgetting c73eb125 (tcp://172.17.26.22:4567)May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Node 8797245b state primMay  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: view(view_id(PRIM,8797245b,21) memb {May  8 18:01:07 XXXX1 mysqld: #0118797245b,0May  8 18:01:07 XXXX1 mysqld: } joined {May  8 18:01:07 XXXX1 mysqld: } left {May  8 18:01:07 XXXX1 mysqld: } partitioned {May  8 18:01:07 XXXX1 mysqld: #011c73eb125,0May  8 18:01:07 XXXX1 mysqld: })May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: save pc into diskMay  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: forgetting c73eb125 (tcp://172.17.26.22:4567)May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 20e6042f-9145-11ea-a8da-d6aa898e6305May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: STATE EXCHANGE: sent state msg: 20e6042f-9145-11ea-a8da-d6aa898e6305May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: STATE EXCHANGE: got state msg: 20e6042f-9145-11ea-a8da-d6aa898e6305 from 0 (perconaslv1)May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Quorum results:May  8 18:01:07 XXXX1 mysqld: #011version    = 6,May  8 18:01:07 XXXX1 mysqld: #011component  = PRIMARY,May  8 18:01:07 XXXX1 mysqld: #011conf_id    = 20,May  8 18:01:07 XXXX1 mysqld: #011members    = 1/1 (joined/total),May  8 18:01:07 XXXX1 mysqld: #011act_id     = 979854,May  8 18:01:07 XXXX1 mysqld: #011last_appl. = 979784,May  8 18:01:07 XXXX1 mysqld: #011protocols  = 0/9/3 (gcs/repl/appl),May  8 18:01:07 XXXX1 mysqld: #011group UUID = 8798a126-8dfc-11ea-8a03-ce798970727aMay  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Flow-control interval: [16, 16]May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Trying to continue unpaused monitorMay  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: New cluster view: global state: 8798a126-8dfc-11ea-8a03-ce798970727a:979854, view# 21: Primary, number of nodes: 1, my index: 0, protocol version 3May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Setting wsrep_ready to trueMay  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 2 -> 1)May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: REPL Protocols: 9 (4, 2)May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Assign initial position for certification: 979854, protocol version: 4May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Service thread queue flushed.May  8 18:01:13 XXXX1 mysqld: 2020-05-08 18:01:13 4092 [Note] WSREP:  cleaning up c73eb125 (tcp://172.17.26.22:4567)
I don’t know how to solve this.
Claudio

Well… you see what on Joiner  the MySQL Process exists before it is expected. There is however no information in this log why. 
I’d check why.  
This is PXC 5.6 which is quite old and I do not remember a lot of its behavior details any more.  I’d encourage to upgrade to 5.7 at least 

Ok i found the problem, but i suspect this is a bug of PXC
my DBs are located in
/var/lib/mysql
my binarylogs are located in
/var/lib/mysql/binarylogs
when a node trying to join and can’t IST, the first operation is a rmdir /var/lib/mysql/*
so the binarylogs directory is wiped
infact in the log i found /var/lib/mysql/binarylogs no such file error
after the sst completed and the move too, it try to start the mysql but without the binarylogs directory it dead
I suspect i will have to move the binary logs someother place, but i suspect it is a bug of PXC
Before the binarylogs were in /var/lib/mysql directory without any problems, now they are in a subdir same directory and it is wiped from sst
Claudio

Great you have found the problem 

You should not place binary logs in  /var/lib/mysql/binlogs  as   every directory in  /var/lib/mysql  is considered to be a database schema and it must contain database schema objects and nothing else 

I will move them outside of mysql directory but i didn’t see any warnings about it.
Claudio

I guess it is assumed to be MySQL Common Sense :smile: