Not the answer you need?
Register and ask your own question!

Joining second node without cluster lock

aesellarsaesellars ContributorInactive User Role Beginner
Is there a way, when recovering from failed node scenarios, to recover from being reduced to a single node without locking the cluster when the next node rejoins?

I have a 3 node cluster with the addition of an arbitrator, and a node had to be taken out for maintenance. When it went to rejoin, it synched from one of the two remaining nodes, hung, and crashed, and took the donor node with it. This has left me in a single node state - and I;d like to be abel to get back to multi-node without taking an outage.

I'm using xtrabackup and xbstream as my SST method, but I've noticed when two nodes synch this way, it locks the donor node. Is there a way around this? Given enough nodes (even having just two already in synch) adding a third node allows one node to remain active and serve requests. But what do you do when you're down to one node and need to recover without locking the whole cluster for an hour while the data synchs?

Comments

  • jriverajrivera Percona Support Engineer Percona Staff Role
    You can try manually SSTing a node with Xtrabackup. Try if this method works for you -

    Say Node1 is the donor node and Node2 is the joiner
    node1> innobackupex --galera-info /path/to/backup
    

    Move/copy the backup to Node2
    node2> innobackupex --apply-log /path/to/backup
    node2> rm -rf /path/to/datadir
    node2> cp -av /path/to/backup /path/to/datadir
    node2> chown -R mysql:mysql /path/to/datadir
    

    Check Galera GTID:
    node2> cat /var/lib/mysql/xtrabackup_galera_info
    8797f811-7f73-11e2-0800-8b513b3819c1:22809
    

    Initialize the grastate.dat
    node2> vim /var/lib/mysql/grastate.dat
    node2> chown -R mysql:mysql /var/lib/mysql/grastate.dat
    node2> cat /var/lib/mysql/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid: 8797f811-7f73-11e2-0800-8b513b3819c1
    seqno: 22809
    cert_index:
    

    Then start node2
    node2> service mysql start
    

    If all goes well node2 should start with IST only.
  • aesellarsaesellars Contributor Inactive User Role Beginner
    I'll give this a try! Thanks!
    jrivera wrote: »
    You can try manually SSTing a node with Xtrabackup. Try if this method works for you -

    Say Node1 is the donor node and Node2 is the joiner
    node1> innobackupex --galera-info /path/to/backup
    

    Move/copy the backup to Node2
    node2> innobackupex --apply-log /path/to/backup
    node2> rm -rf /path/to/datadir
    node2> cp -av /path/to/backup /path/to/datadir
    node2> chown -R mysql:mysql /path/to/datadir
    

    Check Galera GTID:
    node2> cat /var/lib/mysql/xtrabackup_galera_info
    8797f811-7f73-11e2-0800-8b513b3819c1:22809
    

    Initialize the grastate.dat
    node2> vim /var/lib/mysql/grastate.dat
    node2> chown -R mysql:mysql /var/lib/mysql/grastate.dat
    node2> cat /var/lib/mysql/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid: 8797f811-7f73-11e2-0800-8b513b3819c1
    seqno: 22809
    cert_index:
    

    Then start node2
    node2> service mysql start
    

    If all goes well node2 should start with IST only.
  • aesellarsaesellars Contributor Inactive User Role Beginner
    This worked perfectly.

    Thanks so much!
  • aesellarsaesellars Contributor Inactive User Role Beginner
    So the node comes in, and lasts about5 minutes and is crashing. I'm getting the following in the mysql error log:
    2016-01-11 14:47:28 26516 [Note] WSREP: IST received: c21bef5c-a863-11e5-a95e-86feb94b37d0:42183257
    2016-01-11 14:47:28 26516 [Note] WSREP: 2.0 (moodledata03): State transfer from 1.0 (moodledata01) complete.
    2016-01-11 14:47:28 26516 [Note] WSREP: Shifting JOINER -> JOINED (TO: 42218510)
    2016-01-11 14:47:59 26516 [Note] WSREP: Member 2.0 (moodledata03) synced with group.
    2016-01-11 14:47:59 26516 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 42221260)
    2016-01-11 14:47:59 26516 [Note] WSREP: Synchronized with group, ready for connections
    2016-01-11 14:47:59 26516 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    19:50:28 UTC - mysqld got signal 11 ;
    This could be because you hit a bug. It is also possible that this binary
    or one of the libraries it was linked against is corrupt, improperly built,
    or misconfigured. This error can also be caused by malfunctioning hardware.
    We will try our best to scrape up some info that will hopefully help
    diagnose the problem, but since we have already crashed,
    something is definitely wrong and this may fail.
    Please help us make Percona XtraDB Cluster better by reporting any
    bugs at https://bugs.launchpad.net/percona-xtradb-cluster
    
    key_buffer_size=134217728
    read_buffer_size=4194304
    max_used_connections=28
    max_threads=1002
    thread_count=32
    connection_count=23
    It is possible that mysqld could use up to
    key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 12457966 K  bytes of memory
    Hope that's ok; if not, decrease some variables in the equation.
    
    Thread pointer: 0x3f88ce40
    Attempting backtrace. You can use the following information to find out
    where mysqld died. If you see no messages after this, something went
    terribly wrong...
    Segmentation fault (core dumped)
    160111 15:04:40 mysqld_safe Number of processes running now: 0
    160111 15:04:40 mysqld_safe WSREP: not restarting wsrep node automatically
    160111 15:04:40 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
    

    Are there any known bugs with this method?
  • jriverajrivera Percona Support Engineer Percona Staff Role
    The crash doesn't seem to be helpful. Are you able to start up this node again?
  • EmergeBrandonEmergeBrandon Contributor Inactive User Role Beginner
    I am having this same exact issue after following the steps above.

    2016-06-14 20:58:13 25109 [Note] WSREP: IST received: f904a9a7-db79-11e5-ae9e-6ac3a9358431:936038530
    2016-06-14 20:58:13 25109 [Note] WSREP: 0.0 (10.0.3.20): State transfer from 1.0 (10.0.3.21) complete.
    2016-06-14 20:58:13 25109 [Note] WSREP: Shifting JOINER -> JOINED (TO: 936327934)
    2016-06-14 21:03:44 25109 [Note] WSREP: Member 0.0 (10.0.3.20) synced with group.
    2016-06-14 21:03:44 25109 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 936358072)
    2016-06-14 21:03:45 25109 [Note] WSREP: Synchronized with group, ready for connections
    2016-06-14 21:03:45 25109 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    02:04:31 UTC - mysqld got signal 11 ;
    This could be because you hit a bug. It is also possible that this binary
    or one of the libraries it was linked against is corrupt, improperly built,
    or misconfigured. This error can also be caused by malfunctioning hardware.
    We will try our best to scrape up some info that will hopefully help
    diagnose the problem, but since we have already crashed,
    something is definitely wrong and this may fail.
    Please help us make Percona XtraDB Cluster better by reporting any
    bugs at https://bugs.launchpad.net/percona-xtradb-cluster

    key_buffer_size=268435456
    read_buffer_size=131072
    max_used_connections=449
    max_threads=2002
    thread_count=311
    connection_count=286
    It is possible that mysqld could use up to
    key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1059535 K bytes of memory
    Hope that's ok; if not, decrease some variables in the equation.

    Thread pointer: 0x14727f330
    Attempting backtrace. You can use the following information to find out
    where mysqld died. If you see no messages after this, something went
    terribly wrong...
    stack_bottom = 7f438c2f5d38 thread_stack 0x40000
    /usr/sbin/mysqld(my_print_stacktrace+0x35)[0x8fd375]
    /usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x666264]
    /lib64/libpthread.so.0[0x3058e0f790]
    [0x7f3f04015ea0]

    Trying to get some variables.
    Some pointers may be invalid and cause the dump to abort.
    Query (7f3f04004bf0): is an invalid pointer
    Connection ID (thread ID): 4947
    Status: NOT_KILLED
  • EmergeBrandonEmergeBrandon Contributor Inactive User Role Beginner
    I have tried the same process on a different server and it worked without issue. signal 11 typically indicated a seg fault issue / memory issue. So I am going to reinstall mysql and wsrep etc on the node that wasnt working and then attemp the IST again.
Sign In or Register to comment.

MySQL, InnoDB, MariaDB and MongoDB are trademarks of their respective owners.
Copyright ©2005 - 2020 Percona LLC. All rights reserved.