Problem with SST, never works the first time

soprano · May 4, 2020, 6:37am

Hi all,
i have 3 PXC clusters
the first one PXC is master-master (called A)
the second one PXC is master-master (called B) but slave of PXC called A
the third one PXC is master-master (called C) but slave of PXC called B
if i get one node out of one of these PXC and if IST can’t be got, the SST will never works the first time, i need to restart the mysql daemon 2-3-4 times to get the SST finishing and get the node back in the PXC.
It happened always from day 0 of the life of each of these PXC clusters.
I read several way to fix it and i added this to my.cnf configuration for all the PXC clusters, but it continues to happen and i would like to know if it is normal or not.
/etc/my.cnf
wsrep_sst_method = xtrabackup-v2
[xtrabackup]open-files-limit = 1000000
I suppose this is not enough to fix the problem.
My DBs total size is around 10GB
This is the log on the donor node, after the streaming of the last db
May 4 13:57:59 XXXXX -innobackupex-backup: 200504 13:57:59 [01] Streaming ./wp_w3dev/wp_postmeta.ibdMay 4 13:57:59 XXXXX -innobackupex-backup: 200504 13:57:59 [01] …doneMay 4 13:58:10 XXXXX -wsrep-sst-donor: Total time on donor: 0 secondsMay 4 13:58:10 XXXXX -wsrep-sst-donor: Cleaning up temporary directoriesMay 4 13:58:30 XXXXX journal: Suppressed 7644 messages from /system.slice/system-mysql.sliceMay 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: forgetting fc92078e (tcp://172.17.26.22:4567)May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Node 8797245b state primMay 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: view(view_id(PRIM,8797245b,5) memb {May 4 13:58:30 XXXXX mysqld: #0118797245b,0May 4 13:58:30 XXXXX mysqld: } joined {May 4 13:58:30 XXXXX mysqld: } left {May 4 13:58:30 XXXXX mysqld: } partitioned {May 4 13:58:30 XXXXX mysqld: #011fc92078e,0May 4 13:58:30 XXXXX mysqld: })May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: save pc into diskMay 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: forgetting fc92078e (tcp://172.17.26.22:4567)May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 92594d64-8dfe-11ea-abd2-8ee613652447May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: STATE EXCHANGE: sent state msg: 92594d64-8dfe-11ea-abd2-8ee613652447May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: STATE EXCHANGE: got state msg: 92594d64-8dfe-11ea-abd2-8ee613652447 from 0 (perconaslv1)May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Quorum results:May 4 13:58:30 XXXXX mysqld: #011version = 6,May 4 13:58:30 XXXXX mysqld: #011component = PRIMARY,May 4 13:58:30 XXXXX mysqld: #011conf_id = 4,May 4 13:58:30 XXXXX mysqld: #011members = 1/1 (joined/total),May 4 13:58:30 XXXXX mysqld: #011act_id = 57039,May 4 13:58:30 XXXXX mysqld: #011last_appl. = 56989,May 4 13:58:30 XXXXX mysqld: #011protocols = 0/9/3 (gcs/repl/appl),May 4 13:58:30 XXXXX mysqld: #011group UUID = 8798a126-8dfc-11ea-8a03-ce798970727aMay 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Flow-control interval: [16, 16]May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Trying to continue unpaused monitorMay 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: New cluster view: global state: 8798a126-8dfc-11ea-8a03-ce798970727a:57039, view# 5: Primary, number of nodes: 1, my index: 0, protocol version 3May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Setting wsrep_ready to trueMay 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 2 -> 1)May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: REPL Protocols: 9 (4, 2)May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Assign initial position for certification: 57039, protocol version: 4May 4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Service thread queue flushed.May 4 13:58:35 XXXXX mysqld: 2020-05-04 13:58:35 4092 [Note] WSREP: cleaning up fc92078e (tcp://172.17.26.22:4567)May 4 14:01:01 XXXXX systemd: Started Session 149 of user root.
Maybe there is some other option to fix this problem ?
Claudio
These are the percona packages installed:
rpm -qa | grep -i perconaPercona-XtraDB-Cluster-shared-56-5.6.46-28.38.1.el7.x86_64percona-release-1.0-17.noarchPercona-XtraDB-Cluster-galera-3-3.38-1.el7.x86_64percona-toolkit-3.2.0-1.el7.x86_64percona-xtrabackup-2.3.10-1.el7.x86_64Percona-XtraDB-Cluster-client-56-5.6.46-28.38.1.el7.x86_64Percona-XtraDB-Cluster-server-56-5.6.46-28.38.1.el7.x86_64Percona-XtraDB-Cluster-56-5.6.46-28.38.1.el7.x86_64

soprano · May 4, 2020, 6:38am

in the log of the donor i got this
May 4 14:08:01 XXXXX -innobackupex-backup: xtrabackup: open files limit requested 1000000, set to 10005
can be this the problem ? how can i fix it ?
Claudio

soprano · May 4, 2020, 6:59am

this is that i found in the joiner node in my 5th attempt, after the SST finished and also the move of the directories, there is also completed OK!
the problem is after i bolded it
May 4 14:23:52 XXXXX2 -innobackupex-move: 200504 14:23:52 [01] …doneMay 4 14:23:52 XXXXX2 -innobackupex-move: 200504 14:23:52 completed OK!May 4 14:23:52 XXXXX2 -wsrep-sst-joiner: Move successful, removing /var/lib/mysql//.sstMay 4 14:23:52 XXXXX2 -wsrep-sst-joiner: Galera co-ords from recovery: 8798a126-8dfc-11ea-8a03-ce798970727a:61728May 4 14:23:52 XXXXX2 -wsrep-sst-joiner: Total time on joiner: 0 secondsMay 4 14:23:52 XXXXX2 -wsrep-sst-joiner: Removing the sst_in_progress fileMay 4 14:24:04 XXXXX2 journal: Suppressed 13701 messages from /system.slice/mysql.serviceMay 4 14:24:04 XXXXX2 mysqld: Error in my_thread_global_end(): 1 threads didn’t exitMay 4 14:24:04 XXXXX2 mysqld_safe: 200504 14:24:04 mysqld_safe mysqld from pid file /run/mysqld/mysql.pid endedMay 4 14:24:04 XXXXX2 mysqld_safe: mysqld from pid file /run/mysqld/mysql.pid ended
What does it mean ? Is it a bug or what ?
It happens each time i add or want to recover a node and IST can’t work so SST is used.
Any hint ?
Claudio

soprano · May 4, 2020, 10:01am

Anyway on the second PXC called B i can’t join any node, always i have
mysqld: Error in my_thread_global_end(): 1 threads didn’t exit
after the SST stream, after the move of the DB files, after the completed OK
so now i have the second cluster with only a node (the slave node from PXC called A) that is on in bootstrap mode but that can’t make join any other node.
I found that there was a similar bug fixed, https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1190774
mine is 1 thread instead of two.
Any hint ?
Claudio

Peter · May 8, 2020, 10:21am

May 4 14:24:04 XXXXX2 journal: Suppressed 13701 messages from /system.slice/mysql.service
Do you know what was suppressed ?
Also I’d check the log files from SST script from Donor and Joiner they may contain some details.
This message is unlikely the main issue as it simply says what some threads did not issue

You may also show how logs look different when it succeeds vs when it fails - it may point to some ideas.

soprano · May 8, 2020, 10:27am

Hi peter, the problem now is that i have a slave PXC with only a bootstrapped node that is slave of the first PXC.
If i try to add a node (i have tried with both two i have) all seems went good up to the error upper,
i don’t understand because, sst is over (so it was OK), moving is over (so no problem on files), the only thing that is missing maybe is a restart of mysql
Claudio

soprano · May 8, 2020, 10:37am

Dear Peter,
now i tried again to join node 2 on the slave PXC and i got on the joiner
May 8 18:01:06 XXXX2 -innobackupex-move: 200508 18:01:06 completed OK!May 8 18:01:06 XXXX2 -wsrep-sst-joiner: Move successful, removing /var/lib/mysql//.sstMay 8 18:01:06 XXXX2 -wsrep-sst-joiner: Galera co-ords from recovery: 8798a126-8dfc-11ea-8a03-ce798970727a:979821May 8 18:01:06 XXXX2 -wsrep-sst-joiner: Total time on joiner: 0 secondsMay 8 18:01:06 XXXX2 -wsrep-sst-joiner: Removing the sst_in_progress fileMay 8 18:01:18 XXXX2 mysqld_safe: mysqld from pid file /run/mysqld/mysql.pid endedMay 8 18:01:18 XXXX2 mysqld_safe: 200508 18:01:18 mysqld_safe mysqld from pid file /run/mysqld/mysql.pid endedMay 8 18:01:19 XXXX2 systemd: mysql.service: control process exited, code=exited status=1May 8 18:01:19 XXXX2 mysql-systemd: /usr/bin/mysql-systemd: line 134: kill: (18574) - No such processMay 8 18:01:19 XXXX2 mysql-systemd: ERROR! mysqld_safe with PID 18574 has already exited: FAILUREMay 8 18:01:19 XXXX2 mysql-systemd: WARNING: mysql pid file /run/mysqld/mysql.pid empty or not readableMay 8 18:01:19 XXXX2 mysql-systemd: ERROR! mysql already deadMay 8 18:01:19 XXXX2 systemd: mysql.service: control process exited, code=exited status=2May 8 18:01:19 XXXX2 mysql-systemd: WARNING: mysql pid file /run/mysqld/mysql.pid empty or not readableMay 8 18:01:19 XXXX2 mysql-systemd: WARNING: mysql may be already deadMay 8 18:01:19 XXXX2 systemd: Failed to start Percona XtraDB Cluster.May 8 18:01:19 XXXX2 systemd: Unit mysql.service entered failed state.May 8 18:01:19 2 systemd: mysql.service failed.
this is the donor
May 8 18:00:51 XXXX1 -wsrep-sst-donor: Total time on donor: 0 secondsMay 8 18:00:51 XXXX1 -wsrep-sst-donor: Cleaning up temporary directoriesMay 8 18:01:01 XXXX1 systemd: Started Session 252 of user root.May 8 18:01:07 XXXX1 journal: Suppressed 7235 messages from /system.slice/system-mysql.sliceMay 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: forgetting c73eb125 (tcp://172.17.26.22:4567)May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Node 8797245b state primMay 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: view(view_id(PRIM,8797245b,21) memb {May 8 18:01:07 XXXX1 mysqld: #0118797245b,0May 8 18:01:07 XXXX1 mysqld: } joined {May 8 18:01:07 XXXX1 mysqld: } left {May 8 18:01:07 XXXX1 mysqld: } partitioned {May 8 18:01:07 XXXX1 mysqld: #011c73eb125,0May 8 18:01:07 XXXX1 mysqld: })May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: save pc into diskMay 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: forgetting c73eb125 (tcp://172.17.26.22:4567)May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 20e6042f-9145-11ea-a8da-d6aa898e6305May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: STATE EXCHANGE: sent state msg: 20e6042f-9145-11ea-a8da-d6aa898e6305May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: STATE EXCHANGE: got state msg: 20e6042f-9145-11ea-a8da-d6aa898e6305 from 0 (perconaslv1)May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Quorum results:May 8 18:01:07 XXXX1 mysqld: #011version = 6,May 8 18:01:07 XXXX1 mysqld: #011component = PRIMARY,May 8 18:01:07 XXXX1 mysqld: #011conf_id = 20,May 8 18:01:07 XXXX1 mysqld: #011members = 1/1 (joined/total),May 8 18:01:07 XXXX1 mysqld: #011act_id = 979854,May 8 18:01:07 XXXX1 mysqld: #011last_appl. = 979784,May 8 18:01:07 XXXX1 mysqld: #011protocols = 0/9/3 (gcs/repl/appl),May 8 18:01:07 XXXX1 mysqld: #011group UUID = 8798a126-8dfc-11ea-8a03-ce798970727aMay 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Flow-control interval: [16, 16]May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Trying to continue unpaused monitorMay 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: New cluster view: global state: 8798a126-8dfc-11ea-8a03-ce798970727a:979854, view# 21: Primary, number of nodes: 1, my index: 0, protocol version 3May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Setting wsrep_ready to trueMay 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 2 -> 1)May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: REPL Protocols: 9 (4, 2)May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Assign initial position for certification: 979854, protocol version: 4May 8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Service thread queue flushed.May 8 18:01:13 XXXX1 mysqld: 2020-05-08 18:01:13 4092 [Note] WSREP: cleaning up c73eb125 (tcp://172.17.26.22:4567)
I don’t know how to solve this.
Claudio

Peter · May 8, 2020, 10:46am

Well… you see what on Joiner the MySQL Process exists before it is expected. There is however no information in this log why.
I’d check why.
This is PXC 5.6 which is quite old and I do not remember a lot of its behavior details any more. I’d encourage to upgrade to 5.7 at least

soprano · May 8, 2020, 11:10am

Ok i found the problem, but i suspect this is a bug of PXC
my DBs are located in
/var/lib/mysql
my binarylogs are located in
/var/lib/mysql/binarylogs
when a node trying to join and can’t IST, the first operation is a rmdir /var/lib/mysql/*
so the binarylogs directory is wiped
infact in the log i found /var/lib/mysql/binarylogs no such file error
after the sst completed and the move too, it try to start the mysql but without the binarylogs directory it dead
I suspect i will have to move the binary logs someother place, but i suspect it is a bug of PXC
Before the binarylogs were in /var/lib/mysql directory without any problems, now they are in a subdir same directory and it is wiped from sst
Claudio

Peter · May 8, 2020, 11:15am

Great you have found the problem

You should not place binary logs in /var/lib/mysql/binlogs as every directory in /var/lib/mysql is considered to be a database schema and it must contain database schema objects and nothing else

soprano · May 8, 2020, 11:17am

I will move them outside of mysql directory but i didn’t see any warnings about it.
Claudio

Peter · May 8, 2020, 11:19am

I guess it is assumed to be MySQL Common Sense

Topic		Replies	Views
Freshly installed PXC 5.5.41-37.0 box aborts with no diagnostics after successful SST Percona XtraDB Cluster 5.x	6	802	June 14, 2016
Xtradb cluster keeps failing to join cluster Percona XtraDB Cluster 5.x	12	20510	December 11, 2019
Failure to add new XtraDB cluster nodes Percona XtraDB Cluster 5.x	4	1764	September 12, 2014
WSREP: SST failed: 1 (Operation not permitted) Percona XtraDB Cluster 5.x	8	12421	April 20, 2017
SST failing for third node Percona XtraDB Cluster 8.x	4	1698	July 19, 2022

Problem with SST, never works the first time

Related topics