PXC-8 - Failed to join after restarting the node

  1. Hell

    I’m testing PXC for MySQL 8. I have been able to configure and init the cluster with 5 nodes. During the testing, I’m having some issues.

    Testing procedure:

    Init the cluster with 5 nodes → OK

  2. Restart the bootstrapped node (1st) to normal mode -> OK
  3. In the log i can see that the node is still in bootstrap mode, but I'm starting it normally (systemctl start mysql)
  4. Sop the first node and start it again -> FAIL (see below)
2020-10-23T09:14:30.882507Z 0 [Warning] [MY-000000] [WSREP] Node is running in bootstrap/initialize mode. Disabling pxc_strict_mode checks
2020-10-24T17:05:15.524264Z 2 [Note] [MY-000000] [WSREP] Server status change connected -> joiner
2020-10-24T17:05:15.524288Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2020-10-24T17:05:15.524373Z 0 [Note] [MY-000000] [WSREP] Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.5.70' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib/mysql/plugin/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '4556' --mysqld-version '8.0.20-11.3'   '' )
2020-10-24T17:05:16.109763Z 2 [Note] [MY-000000] [WSREP] Prepared SST request: xtrabackup-v2|192.168.5.70:4444/xtrabackup_sst//1
2020-10-24T17:05:16.109810Z 2 [Note] [MY-000000] [Galera] Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
2020-10-24T17:05:16.109898Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2020-10-24T17:05:16.109961Z 2 [Note] [MY-000000] [Galera] ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: 5
2020-10-24T17:05:16.109994Z 2 [Note] [MY-000000] [Galera] Check if state gap can be serviced using IST
2020-10-24T17:05:16.110012Z 2 [Note] [MY-000000] [Galera] Local UUID: 00000000-0000-0000-0000-000000000000 != Group UUID: 450e8f4b-1510-11eb-91bd-a6ab83d20e9a
2020-10-24T17:05:16.110030Z 2 [Note] [MY-000000] [Galera] ####### IST uuid:00000000-0000-0000-0000-000000000000 f: 0, l: 3627, STRv: 3
2020-10-24T17:05:16.110168Z 2 [Note] [MY-000000] [Galera] IST receiver addr using tcp://192.168.5.70:4568
2020-10-24T17:05:16.110266Z 2 [Note] [MY-000000] [Galera] Prepared IST receiver for 0-3627, listening at: tcp://192.168.5.70:4568
2020-10-24T17:05:18.019992Z 0 [Note] [MY-000000] [Galera] (158e52b0, 'tcp://0.0.0.0:4567') turning message relay requesting off
2020-10-24T17:06:55.887925Z 0 [ERROR] [MY-000000] [WSREP-SST] ******************* FATAL ERROR ********************** 
2020-10-24T17:06:55.888135Z 0 [ERROR] [MY-000000] [WSREP-SST] Possible timeout in receving first data from donor in gtid/keyring stage
2020-10-24T17:06:55.888237Z 0 [ERROR] [MY-000000] [WSREP-SST] Line 1108
2020-10-24T17:06:55.888320Z 0 [ERROR] [MY-000000] [WSREP-SST] ****************************************************** 
2020-10-24T17:06:55.888387Z 0 [ERROR] [MY-000000] [WSREP-SST] Cleanup after exit with status:32
2020-10-24T17:06:55.905996Z 0 [ERROR] [MY-000000] [WSREP] Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.5.70' --datadir '/var/lib/mysql/' --basedir '/usr/' --plugindir '/usr/lib/mysql/plugin/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '4556' --mysqld-version '8.0.20-11.3'   '' : 32 (Broken pipe)
2020-10-24T17:06:55.906143Z 0 [ERROR] [MY-000000] [WSREP] Failed to read uuid:seqno from joiner script.
2020-10-24T17:06:55.90629 0 [ERROR] [MY-000000] [WSREP] SST script aborted with error 32 (Broken pipe)
2020-10-24T17:06:55.906571Z 3 [Note] [MY-000000] [Galera] Processing SST received
2020-10-24T17:06:55.906676Z 3 [Note] [MY-000000] [Galera] SST request was cancelled
2020-10-24T17:06:55.906820Z 3 [ERROR] [MY-000000] [Galera] State transfer request failed unrecoverably: 32 (Broken pipe). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
2020-10-24T17:06:55.906920Z 3 [Note] [MY-000000] [Galera] ReplicatorSMM::abort()
2020-10-24T17:06:55.907042Z 3 [Note] [MY-000000] [Galera] Closing send monitor...
2020-10-24T17:06:55.907256Z 3 [Note] [MY-000000] [Galera] Closed send monitor.
2020-10-24T17:06:55.907353Z 2 [ERROR] [MY-000000] [Galera] Requesting state transfer failed: -77(File descriptor in bad state)
2020-10-24T17:06:55.907541Z 3 [Note] [MY-000000] [Galera] gcomm: terminating thread
2020-10-24T17:06:55.907648Z 2 [ERROR] [MY-000000] [Galera] State transfer request failed unrecoverably: 77 (File descriptor in bad state). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
2020-10-24T17:06:55.907881Z 2 [Note] [MY-000000] [Galera] ReplicatorSMM::abort()
2020-10-24T17:06:55.907978Z 3 [Note] [MY-000000] [Galera] gcomm: joining thread
2020-10-24T17:06:55.908120Z 3 [Note] [MY-000000] [Galera] gcomm: closing backend
2020-10-24T17:06:55.908197Z 2 [Note] [MY-000000] [Galera] /usr/sbin/mysqld: Terminated.
2020-10-24T17:06:55.908327Z 2 [Note] [MY-000000] [WSREP] Initiating SST cancellation

Please see configuration file of the first node:

[mysqld]

server-id = 1

pid-file	= /var/run/mysqld/mysqld.pid
socket		= /var/run/mysqld/mysqld.sock
datadir		= /var/lib/mysql
log-error	= /var/log/mysql/error.log
sql_mode	= STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION


# LOGGING #
# skip-log-bin
binlog_expire_logs_seconds=604800
log_error = /var/log/mysql/error.log
slow_query_log = 1
slow_query_log_file	= /var/log/mysql/slow.log
log_slow_rate_type = query
long_query_time = 3
log_slow_rate_limit	= 250
log_slow_verbosity	= full
slow_query_log_use_global_control = all
log-queries-not-using-indexes = 1

join-buffer-size = 1M
innodb_doublewrite=0

# Metrics 
performance_schema=ON

# INNODB SETTINGS

innodb-flush-method            = O_DIRECT
innodb-log-files-in-group      = 2
innodb-log-file-size           = 8G
innodb-flush-log-at-trx-commit = 1
innodb-file-per-table          = 1
innodb-buffer-pool-size        = 4G # max 80% of available memory
innodb-buffer-pool-instances   = 4 # Should be 1 per 1G of memory
innodb-flush-log-at-trx-commit = 2
innodb-thread-concurrency = 64

# CACHES AND LIMITS
tmp-table-size                 = 32M
max-heap-table-size            = 32M
thread-cache-size              = 50
open-files-limit               = 65535
table-definition-cache         = 1024
table-open-cache               = 2048

# TUNE UPS
# Skip reverse DNS lookup of clients
skip-name-resolve
userstat = 1
thread_pool_size = 16
thread_handling = 2 # Number of CPU cores

# SECURITY
# connection-control=FORCE_PLUS_PERMANENT
# connection-control-failed-login-attempts=FORCE_PLUS_PERMANENT
# connection_control_failed_connections_threshold=5 # Number of failed login attempts
# connection_control_min_connection_delay=5000 # Added delay after failed login attempts

# CLUSTER SETTINGS
wsrep_provider=/usr/lib/galera4/libgalera_smm.so
wsrep_cluster_address=gcomm://192.168.x.xx,192.168.x.63,192.168.x.xx,192.168.x.xx,192.168.x.xx
binlog_format=ROW
wsrep_slave_threads=16
wsrep_log_conflicts
innodb_autoinc_lock_mode=2
wsrep_cluster_name=motv_test
wsrep_node_name=sql1
wsrep_node_address=192.168.x.xx
pxc_strict_mode=DISABLED
wsrep_sst_method=xtrabackup-v2
pxc-encrypt-cluster-traffic=OFF
wsrep_provider_options="gcache.size=3G;gcache.page _size=1G;gcache.recover=yes"
    What I did:
  1. Changed timeouts in systemd service, not helped
  2. I tried edit service_startup_timeout=900 variable in /usr/bin/mysql-systemd. Changed it to 9000 - not helped

Both of points are from google, so I don’t have any idea how to fix this issue.

Has someone the same or at least similar issues or is it a bug?

Thanks a lot for answers and tips.

Martin

1 Like

Hi @Nethe First off, there’s no need to restart your bootstrapped node back into “normal” mode. It looks like your first node has issues with the grastate.dat file. When you cleanly shut down a node, it should update this file with the correct UUID of the cluster and the sequence number. If this file has all 0’s, then it must SST from another node.

Check your process again. Stop all nodes. Bootstrap first node. Bring other nodes online. Make sure you gracefully stop node1. Check the contents of grastate.dat and make sure there is a real UUID in there. Then start node normally.