Can't restart node after brutal shutdown in 3 nodes pxc 57 - systemd related ?

Hi,

So let’s start with my setup:

  • 3 nodes pxc cluster (up to date CentOS 7 and Percona-XtraDB-Cluster-57 5.7.19-29.22.1.el7)
    mysql-test-01 10.0.0.51
    mysql-test-02 10.0.0.52
    mysql-test-03 10.0.0.53
  • I have followed all the guidelines for CentOS (setenforce 0, …)
  • I bootstraped the cluster from the node “mysql-test-01” 10.0.0.51 using “systemctl start mysql@bootstrap.service
  • Other nodes joined the cluster without problem
  • The cluster was working

I did some tests, I gracefully shutdown node “10.0.0.53”, the cluster kept working, I restarted it with “systemctl start mysql”. It rejoined the cluster and recovered.
I did the same thing (graceful shutdown) with node “10.0.0.51”. Everything OK

The problem occured when I “brutally” shutdown the node “10.0.0.51”. I then started again the virtual machine, and then I use “systemctl start mysql.service” to start the node. It should start and automatically recover but it’s not working. I am pretty sure this is related to “systemd” but I am confused because I didn’t touch the default configuration that come with the package. I was hoping a working systemd configuration.

More informations:

configuration of node “10.0.0.51”

​​​​​​​
---------- /etc/my.cnf ----------

#
# The Percona XtraDB Cluster 5.7 configuration file.
#
#
# * IMPORTANT: Additional settings that can override those from this file!
# The files must end with '.cnf', otherwise they'll be ignored.
# Please make any edits and changes to the appropriate sectional files
# included below.
#
!includedir /etc/my.cnf.d/
!includedir /etc/percona-xtradb-cluster.conf.d/
----------------------------------------

---------- /etc/percona-xtradb-cluster.conf.d/wsrep.cnf ----------

[mysqld]
# Path to Galera library
wsrep_provider=/usr/lib64/galera3/libgalera_smm.so

# Cluster connection URL contains IPs of nodes
#If no IP is found, this implies that a new cluster needs to be created,
#in order to do that you need to bootstrap this node
wsrep_cluster_address=gcomm://10.0.0.51,10.0.0.52,10.0.0.53

# In order for Galera to work correctly binlog format should be ROW
binlog_format=ROW

# MyISAM storage engine has only experimental support
default_storage_engine=InnoDB

# Slave thread to use
wsrep_slave_threads= 8

wsrep_log_conflicts

# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2

# Node IP address
wsrep_node_address=10.0.0.51
# Cluster name
wsrep_cluster_name=pxc-cluster-test

#If wsrep_node_name is not specified, then system hostname will be used
wsrep_node_name=mysql-test-01

#pxc_strict_mode allowed values: DISABLED,PERMISSIVE,ENFORCING,MASTER
pxc_strict_mode=ENFORCING

# SST method
wsrep_sst_method=xtrabackup-v2

#Authentication for SST method
wsrep_sst_auth="sstuser:password"
----------------------------------------

---------- /etc/percona-xtradb-cluster.conf.d/mysqld.cnf ----------

# MYSQL-TEST-01
# Template my.cnf for PXC
# Edit to your requirements.
[client]
socket=/var/lib/mysql/mysql.sock

[mysqld]
server-id=51
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
log-bin
log_slave_updates
expire_logs_days=7

# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0

# Custom config
max-allowed-packet=64M

#
# InnoDB settings
#
innodb_file_per_table
innodb_buffer_pool_size=1024M
innodb_log_file_size=256M
innodb_flush_log_at_trx_commit=1
innodb_flush_method=O_DIRECT
--------------------------------------------------

And finally an extract from systemd journalctl

----------------------------------------
[root@mysql-test-01 ~]# systemctl start mysql.service
Job for mysql.service failed because the control process exited with error code. See "systemctl status mysql.service" and "journalctl -xe" for details.
[root@mysql-test-01 ~]# journalctl -u mysql.service
....
Oct 02 17:21:39 mysql-test-01 systemd[1]: Starting Percona XtraDB Cluster...
Oct 02 17:21:40 mysql-test-01 mysqld_safe[2761]: 2017-10-02T15:21:40.081340Z mysqld_safe Logging to '/var/log/mysqld.log'.
Oct 02 17:21:40 mysql-test-01 mysqld_safe[2761]: 2017-10-02T15:21:40.085011Z mysqld_safe Logging to '/var/log/mysqld.log'.
Oct 02 17:21:40 mysql-test-01 mysqld_safe[2761]: 2017-10-02T15:21:40.119964Z mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Oct 02 17:21:40 mysql-test-01 mysqld_safe[2761]: 2017-10-02T15:21:40.140406Z mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.IG3oHj' --pid-file='/var/lib/mysql/mysql-test-01-recover.pid'
Oct 02 17:21:43 mysql-test-01 mysql-systemd[2762]: /usr/bin/mysql-systemd: line 140: kill: (2761) - No such process
Oct 02 17:21:43 mysql-test-01 mysql-systemd[2762]: ERROR! mysqld_safe with PID 2761 has already exited: FAILURE
Oct 02 17:21:43 mysql-test-01 systemd[1]: mysql.service: control process exited, code=exited status=1
Oct 02 17:21:43 mysql-test-01 mysql-systemd[3332]: WARNING: mysql pid file /var/lib/mysql/mysqld.pid empty or not readable
Oct 02 17:21:43 mysql-test-01 mysql-systemd[3332]: ERROR! mysql already dead
Oct 02 17:21:43 mysql-test-01 systemd[1]: mysql.service: control process exited, code=exited status=2
----------------------------------------

I have read a lot about systemd problems with mysql and pxc, but I can’t find a clue about mine. I am stuck because I have simulated a failure and I can’t restart my cluster, it’s a dev environnement but I won’t continue further until I find a solution.

Any ideas ?

​​​​​​​Regards,

1 Like

Just a complement with this article: [url]https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/[/url]

I have reproduced scenario 4 but the node won’t restart.

1 Like

So I am trying to manually debug this problem and I have noticed two things.

  1. Always on CentOS 7 Percona-XtraDB-Cluster-57 5.7.19-29.22.1.el7, in mysqld_safe, line 296
eval_log_error "$mysqld_cmd --wsrep_recover $wr_options"

Is it correct ? I am a bit confused about the options syntax so would it not be better like this

eval_log_error "$mysqld_cmd _wsrep_recover $wr_options"

or this

eval_log_error "$mysqld_cmd --wsrep-recover $wr_options"

I put a lot of debug print in mysqld_safe and mysql-systemd
I can manually restart the cluster and it seems to recover well with this command line:

$mysqld_cmd line 296
nohup /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera3/libgalera_smm.so --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock

But when the “wsrep” options are added to the command line in the mysqld_safe script, it does not work after

$cmd line 218
nohup /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera3/libgalera_smm.so --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock --wsrep-recover --log_error='/var/lib/mysql/wsrep_recovery.PUR9Wb' --pid-file='/var/lib/mysql/mysql-test-01-recover.pid' > /dev/null 2>&1

And we can also notice here that the mysqld_safe script generate a command line with two options “–pid-file” and two options “–log-error”.

Regards,

1 Like

After several hours, I finally found out the problem. Spoiler alert: it’s stupid

There is a bad integer comparisons in the “mysqld_safe” script.


[root@mysql-test-01 ~]# diff /root/mysqld_safe-BAD /usr/bin/mysqld_safe
220c220
< if [ $ret > 0 ]; then
---
> if [ "$ret" -gt "0" ]; then

Now it’s working fine after a brutal shutdown.

Should I report a bug somewhere in order to correct the issue in futur releases ?

Regards,

1 Like

Thanks vleclere.

You can report the issue here [url]https://bugs.launchpad.net/percona-xtradb-cluster/[/url]

1 Like

Well found ‘vleclere:slight_smile:
Bug not reported from what I can tell, but at least you have found a solution and documented it here, many thanks for doing that.

To reproduce:
[LIST]
[]I Installed Virtualbox - (I was making a Dev server)
[
]I Installed a minimal - CentOS Linux release 7.4.1708 (Core) - virtual machine - or just a regular server.
[]Install latest Percona repo following docs
[LIST]
[
]
[/LIST]
[]Install Percona XtraDB Cluster 5.7
[LIST]
[
]
[/LIST]
[]Do what the install requests (I missed this the first time)
[LIST]
[
]
[]
[
]To see all the new my config files and folders optionally do the below
[]
[/LIST]
[
]Start mysql now with systemctl / systemd (bootstrap or don’t I guess)
[LIST]
[]
[/LIST]
[
]Get root password
[LIST]
[]
[/LIST]
[
]
[LIST]
[]enter password
[/LIST]
[
]Change mysql password
[LIST]
[]
[
]
[]restart mysql (or stop start)
[LIST]
[
]
[/LIST]
[/LIST]
[]Maybe or maybe don’t add a database and or insert some data. I did, and also when this bug hit did alot of my.cnf tweaks and /usr/lib/systemd/system/mysql.service / systemctl daemon-reload tweaks to no avail.
[
]As vleclere put it so well CREATE a brutal shutdown. Kill the vm or server without a graceful shutdown.
[LIST]
[]Power off the virtual machine or server as if the power failed without a backup UPS. So on virtual box i think that is the power off option without CentOS cleanly shutting down all servers like Percona / mysql. Rather common on a Windows machine as the virtual box is killed if you get a bluescreen of death or a Windows 10 surprise update.
[/LIST]
[
]Try and start up Percona/mysql after boot and you should get the failure as further below, if not I have possibly left something out - break it… :slight_smile:
[/LIST]
Just to clarify the solution for others that have this problem. Big thanks to vleclere.

then you edit +/- line 220 or search for “”

change from

change to
; then

Error message was:

journalctl -xe OR systemctl statusmysql.service
Oct 15 14:22:39 reg mysql-systemd[8330]: /usr/bin/mysql-systemd: line 140: kill: (8329) - No such process
Oct 15 14:22:39 reg mysql-systemd[8330]: ERROR! mysqld_safe with PID 8329 has already exited: FAILURE
Oct 15 14:22:39 reg systemd[1]: mysql.service: control process exited, code=exited status=1
Oct 15 14:22:39 reg mysql-systemd[8505]: WARNING: mysql pid file /var/lib/mysql/reg.resafrica-reg.com.pid empty or not readable
Oct 15 14:22:39 reg mysql-systemd[8505]: ERROR! mysql already dead
Oct 15 14:22:39 reg systemd[1]: mysql.service: control process exited, code=exited status=2
Oct 15 14:22:39 reg mysql-systemd[8547]: WARNING: mysql pid file /var/lib/mysql/reg.resafrica-reg.com.pid empty or not readable
Oct 15 14:22:39 reg mysql-systemd[8547]: WARNING: mysql may be already dead
Oct 15 14:22:39 reg systemd[1]: Failed to start Percona XtraDB Cluster.
– Subject: Unit mysql.service has failed

Regards
Rory

1 Like

I still have this issue and I have changed the /usr/bin/mysqld_safe at line 220

I found a way to start the node:

  1. Run

/usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/galera3/libgalera_smm.so --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock

  1. ps aux | grep mysql and kill the proccess, wait to shutdown

  2. service mysql start

Also, service mysql start as root and make the mysql.pid as root, to fix this you need to modify /usr/lib/systemd/system/mysql.service and put

User=mysql
Group=mysql

like this:

[Service]

Needed to create system tables etc.

ExecStartPre=/usr/bin/mysql-systemd start-pre

User=mysql
Group=mysql

When do you have in mind to fix this?

Thanks.

1 Like

Hello,

I’ve reported this issue on launchpad.

Full credit to vleclere

Thanks
Chris

1 Like

hello.
can i get something for fix this proplem. tks

1 Like