Not the answer you need?
Register and ask your own question!

Problem with SST, never works the first time

sopranosoprano EntrantCurrent User Role Contributor
Hi all,
i have 3 PXC clusters
the first one PXC is master-master (called A)
the second one PXC is master-master (called B) but slave of PXC called A
the third one PXC is master-master (called C) but slave of PXC called B
if i get one node out of one of these PXC and if IST can't be got, the SST will never works the first time, i need to restart the mysql daemon 2-3-4 times to get the SST finishing and get the node back in the PXC.
It happened always from day 0 of the life of each of these PXC clusters.
I read several way to fix it and i added this to my.cnf configuration for all the PXC clusters, but it continues to happen and i would like to know if it is normal or not.
/etc/my.cnf
wsrep_sst_method                = xtrabackup-v2
[xtrabackup]
open-files-limit = 1000000
I suppose this is not enough to fix the problem.
My DBs total size is around 10GB
This is the log on the donor node, after the streaming of the last db
May  4 13:57:59 XXXXX -innobackupex-backup: 200504 13:57:59 [01] Streaming ./wp_w3dev/wp_postmeta.ibd
May  4 13:57:59 XXXXX -innobackupex-backup: 200504 13:57:59 [01]        ...done
May  4 13:58:10 XXXXX -wsrep-sst-donor: Total time on donor: 0 seconds
May  4 13:58:10 XXXXX -wsrep-sst-donor: Cleaning up temporary directories
May  4 13:58:30 XXXXX journal: Suppressed 7644 messages from /system.slice/system-mysql.slice
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: forgetting fc92078e (tcp://172.17.26.22:4567)
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Node 8797245b state prim
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: view(view_id(PRIM,8797245b,5) memb {
May  4 13:58:30 XXXXX mysqld: #0118797245b,0
May  4 13:58:30 XXXXX mysqld: } joined {
May  4 13:58:30 XXXXX mysqld: } left {
May  4 13:58:30 XXXXX mysqld: } partitioned {
May  4 13:58:30 XXXXX mysqld: #011fc92078e,0
May  4 13:58:30 XXXXX mysqld: })
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: save pc into disk
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: forgetting fc92078e (tcp://172.17.26.22:4567)
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 92594d64-8dfe-11ea-abd2-8ee613652447
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: STATE EXCHANGE: sent state msg: 92594d64-8dfe-11ea-abd2-8ee613652447
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: STATE EXCHANGE: got state msg: 92594d64-8dfe-11ea-abd2-8ee613652447 from 0 (perconaslv1)
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Quorum results:
May  4 13:58:30 XXXXX mysqld: #011version    = 6,
May  4 13:58:30 XXXXX mysqld: #011component  = PRIMARY,
May  4 13:58:30 XXXXX mysqld: #011conf_id    = 4,
May  4 13:58:30 XXXXX mysqld: #011members    = 1/1 (joined/total),
May  4 13:58:30 XXXXX mysqld: #011act_id     = 57039,
May  4 13:58:30 XXXXX mysqld: #011last_appl. = 56989,
May  4 13:58:30 XXXXX mysqld: #011protocols  = 0/9/3 (gcs/repl/appl),
May  4 13:58:30 XXXXX mysqld: #011group UUID = 8798a126-8dfc-11ea-8a03-ce798970727a
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Flow-control interval: [16, 16]
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Trying to continue unpaused monitor
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: New cluster view: global state: 8798a126-8dfc-11ea-8a03-ce798970727a:57039, view# 5: Primary, number of nodes: 1, my index: 0, protocol version 3
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Setting wsrep_ready to true
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 2 -> 1)
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: REPL Protocols: 9 (4, 2)
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Assign initial position for certification: 57039, protocol version: 4
May  4 13:58:30 XXXXX mysqld: 2020-05-04 13:58:30 4092 [Note] WSREP: Service thread queue flushed.
May  4 13:58:35 XXXXX mysqld: 2020-05-04 13:58:35 4092 [Note] WSREP:  cleaning up fc92078e (tcp://172.17.26.22:4567)
May  4 14:01:01 XXXXX systemd: Started Session 149 of user root.
Maybe there is some other option to fix this problem ?
Claudio
These are the percona packages installed:
rpm -qa | grep -i percona
Percona-XtraDB-Cluster-shared-56-5.6.46-28.38.1.el7.x86_64
percona-release-1.0-17.noarch
Percona-XtraDB-Cluster-galera-3-3.38-1.el7.x86_64
percona-toolkit-3.2.0-1.el7.x86_64
percona-xtrabackup-2.3.10-1.el7.x86_64
Percona-XtraDB-Cluster-client-56-5.6.46-28.38.1.el7.x86_64
Percona-XtraDB-Cluster-server-56-5.6.46-28.38.1.el7.x86_64
Percona-XtraDB-Cluster-56-5.6.46-28.38.1.el7.x86_64

Best Answer

  • sopranosoprano Entrant Current User Role Contributor
    Accepted Answer
    Ok i found the problem, but i suspect this is a bug of PXC
    my DBs are located in
    /var/lib/mysql
    my binarylogs are located in
    /var/lib/mysql/binarylogs
    when a node trying to join and can't IST, the first operation is a rmdir /var/lib/mysql/*
    so the binarylogs directory is wiped
    infact in the log i found /var/lib/mysql/binarylogs no such file error
    after the sst completed and the move too, it try to start the mysql but without the binarylogs directory it dead
    I suspect i will have to move the binary logs someother place, but i suspect it is a bug of PXC
    Before the binarylogs were in /var/lib/mysql directory without any problems, now they are in a subdir same directory and it is wiped from sst
    Claudio

Answers

  • sopranosoprano Entrant Current User Role Contributor
    in the log of the donor i got this
    May  4 14:08:01 XXXXX -innobackupex-backup: xtrabackup: open files limit requested 1000000, set to 10005
    can be this the problem ? how can i fix it ?
    Claudio
  • sopranosoprano Entrant Current User Role Contributor
    edited May 4
    this is that i found in the joiner node in my 5th attempt, after the SST finished and also the move of the directories, there is also completed OK!
    the problem is after i bolded it
    May  4 14:23:52 XXXXX2 -innobackupex-move: 200504 14:23:52 [01]        ...done
    May  4 14:23:52 XXXXX2 -innobackupex-move: 200504 14:23:52 completed OK!
    May  4 14:23:52 XXXXX2 -wsrep-sst-joiner: Move successful, removing /var/lib/mysql//.sst
    May  4 14:23:52 XXXXX2 -wsrep-sst-joiner: Galera co-ords from recovery: 8798a126-8dfc-11ea-8a03-ce798970727a:61728
    May  4 14:23:52 XXXXX2 -wsrep-sst-joiner: Total time on joiner: 0 seconds
    May  4 14:23:52 XXXXX2 -wsrep-sst-joiner: Removing the sst_in_progress file
    May  4 14:24:04 XXXXX2 journal: Suppressed 13701 messages from /system.slice/mysql.service
    May  4 14:24:04 XXXXX2 mysqld: Error in my_thread_global_end(): 1 threads didn't exit
    May  4 14:24:04 XXXXX2 mysqld_safe: 200504 14:24:04 mysqld_safe mysqld from pid file /run/mysqld/mysql.pid ended
    May  4 14:24:04 XXXXX2 mysqld_safe: mysqld from pid file /run/mysqld/mysql.pid ended
    What does it mean ? Is it a bug or what ?
    It happens each time i add or want to recover a node and IST can't work so SST is used.
    Any hint ?
    Claudio
  • sopranosoprano Entrant Current User Role Contributor
    Anyway on the second PXC called B i can't join any node, always i have
    mysqld: Error in my_thread_global_end(): 1 threads didn't exit
    after the SST stream, after the move of the DB files, after the completed OK
    so now i have the second cluster with only a node (the slave node from PXC called A) that is on in bootstrap mode but that can't make join any other node.
    I found that there was a similar bug fixed, https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1190774
    mine is 1 thread instead of two.
    Any hint ?
    Claudio
  • PeterPeter Percona CEO Percona Moderator Role
    May  4 14:24:04 XXXXX2 journal: Suppressed 13701 messages from /system.slice/mysql.service
    Do you know what was suppressed ?
    Also I'd check the log files from SST script from Donor and Joiner  they may contain some details.
    This message is unlikely the main issue as it simply says what some  threads did not issue 

    You may also show how logs look different when it succeeds vs when it fails - it may point to some ideas.
  • sopranosoprano Entrant Current User Role Contributor
    Hi peter, the problem now is that i have a slave PXC with only a bootstrapped node that is slave of the first PXC.
    If i try to add a node (i have tried with both two i have) all seems went good up to the error upper,
    i don't understand because, sst is over (so it was OK), moving is over (so no problem on files), the only thing that is missing maybe is a restart of mysql
    Claudio
  • sopranosoprano Entrant Current User Role Contributor
    Dear Peter, 
    now i tried again to join node 2 on the slave PXC and i got on the joiner
    May  8 18:01:06 XXXX2 -innobackupex-move: 200508 18:01:06 completed OK!
    May  8 18:01:06 XXXX2 -wsrep-sst-joiner: Move successful, removing /var/lib/mysql//.sst
    May  8 18:01:06 XXXX2 -wsrep-sst-joiner: Galera co-ords from recovery: 8798a126-8dfc-11ea-8a03-ce798970727a:979821
    May  8 18:01:06 XXXX2 -wsrep-sst-joiner: Total time on joiner: 0 seconds
    May  8 18:01:06 XXXX2 -wsrep-sst-joiner: Removing the sst_in_progress file
    May  8 18:01:18 XXXX2 mysqld_safe: mysqld from pid file /run/mysqld/mysql.pid ended
    May  8 18:01:18 XXXX2 mysqld_safe: 200508 18:01:18 mysqld_safe mysqld from pid file /run/mysqld/mysql.pid ended
    May  8 18:01:19 XXXX2 systemd: mysql.service: control process exited, code=exited status=1
    May  8 18:01:19 XXXX2 mysql-systemd: /usr/bin/mysql-systemd: line 134: kill: (18574) - No such process
    May  8 18:01:19 XXXX2 mysql-systemd: ERROR! mysqld_safe with PID 18574 has already exited: FAILURE
    May  8 18:01:19 XXXX2 mysql-systemd: WARNING: mysql pid file /run/mysqld/mysql.pid empty or not readable
    May  8 18:01:19 XXXX2 mysql-systemd: ERROR! mysql already dead
    May  8 18:01:19 XXXX2 systemd: mysql.service: control process exited, code=exited status=2
    May  8 18:01:19 XXXX2 mysql-systemd: WARNING: mysql pid file /run/mysqld/mysql.pid empty or not readable
    May  8 18:01:19 XXXX2 mysql-systemd: WARNING: mysql may be already dead
    May  8 18:01:19 XXXX2 systemd: Failed to start Percona XtraDB Cluster.
    May  8 18:01:19 XXXX2 systemd: Unit mysql.service entered failed state.
    May  8 18:01:19 2 systemd: mysql.service failed.
    this is the donor 
    May  8 18:00:51 XXXX1 -wsrep-sst-donor: Total time on donor: 0 seconds
    May  8 18:00:51 XXXX1 -wsrep-sst-donor: Cleaning up temporary directories
    May  8 18:01:01 XXXX1 systemd: Started Session 252 of user root.
    May  8 18:01:07 XXXX1 journal: Suppressed 7235 messages from /system.slice/system-mysql.slice
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: forgetting c73eb125 (tcp://172.17.26.22:4567)
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Node 8797245b state prim
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: view(view_id(PRIM,8797245b,21) memb {
    May  8 18:01:07 XXXX1 mysqld: #0118797245b,0
    May  8 18:01:07 XXXX1 mysqld: } joined {
    May  8 18:01:07 XXXX1 mysqld: } left {
    May  8 18:01:07 XXXX1 mysqld: } partitioned {
    May  8 18:01:07 XXXX1 mysqld: #011c73eb125,0
    May  8 18:01:07 XXXX1 mysqld: })
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: save pc into disk
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: forgetting c73eb125 (tcp://172.17.26.22:4567)
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 20e6042f-9145-11ea-a8da-d6aa898e6305
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: STATE EXCHANGE: sent state msg: 20e6042f-9145-11ea-a8da-d6aa898e6305
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: STATE EXCHANGE: got state msg: 20e6042f-9145-11ea-a8da-d6aa898e6305 from 0 (perconaslv1)
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Quorum results:
    May  8 18:01:07 XXXX1 mysqld: #011version    = 6,
    May  8 18:01:07 XXXX1 mysqld: #011component  = PRIMARY,
    May  8 18:01:07 XXXX1 mysqld: #011conf_id    = 20,
    May  8 18:01:07 XXXX1 mysqld: #011members    = 1/1 (joined/total),
    May  8 18:01:07 XXXX1 mysqld: #011act_id     = 979854,
    May  8 18:01:07 XXXX1 mysqld: #011last_appl. = 979784,
    May  8 18:01:07 XXXX1 mysqld: #011protocols  = 0/9/3 (gcs/repl/appl),
    May  8 18:01:07 XXXX1 mysqld: #011group UUID = 8798a126-8dfc-11ea-8a03-ce798970727a
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Flow-control interval: [16, 16]
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Trying to continue unpaused monitor
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: New cluster view: global state: 8798a126-8dfc-11ea-8a03-ce798970727a:979854, view# 21: Primary, number of nodes: 1, my index: 0, protocol version 3
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Setting wsrep_ready to true
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 2 -> 1)
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: REPL Protocols: 9 (4, 2)
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Assign initial position for certification: 979854, protocol version: 4
    May  8 18:01:07 XXXX1 mysqld: 2020-05-08 18:01:07 4092 [Note] WSREP: Service thread queue flushed.
    May  8 18:01:13 XXXX1 mysqld: 2020-05-08 18:01:13 4092 [Note] WSREP:  cleaning up c73eb125 (tcp://172.17.26.22:4567)
    I don't know how to solve this.
    Claudio
  • PeterPeter Percona CEO Percona Moderator Role
    Well... you see what on Joiner  the MySQL Process exists before it is expected. There is however no information in this log why. 
    I'd check why.  
    This is PXC 5.6 which is quite old and I do not remember a lot of its behavior details any more.  I'd encourage to upgrade to 5.7 at least 
  • PeterPeter Percona CEO Percona Moderator Role
    Great you have found the problem 

    You should not place binary logs in  /var/lib/mysql/binlogs  as   every directory in  /var/lib/mysql  is considered to be a database schema and it must contain database schema objects and nothing else 
  • sopranosoprano Entrant Current User Role Contributor
    I will move them outside of mysql directory but i didn't see any warnings about it.
    Claudio
  • PeterPeter Percona CEO Percona Moderator Role
    I guess it is assumed to be MySQL Common Sense :smile:
Sign In or Register to comment.

MySQL, InnoDB, MariaDB and MongoDB are trademarks of their respective owners.
Copyright ©2005 - 2020 Percona LLC. All rights reserved.