Non-primary nodes cannot receive state

Hello all,

I’ve been running a 4 node mariadb galera cluster for about a month now (ubuntu 12.04). Over the weekend I was changing the structure of quite a few tables and re-importing a bunch of data. During this, mysql on one of the nodes ceased working. I reached a stopping point, shut everything down and restarted. Now none of the non-primary nodes can sync back with the primary node. For sst I had been using

wsrep_sst_method=xtrabackup

and in innobackup.backup.log I see the message

InnoDB Backup Utility v1.5.1-xtrabackup; Copyright 2003, 2009 Innobase Oy
and Percona Ireland Ltd 2009-2012. All Rights Reserved.

This software is published under
the GNU GENERAL PUBLIC LICENSE Version 2, June 1991.

131021 09:56:04 innobackupex: Connecting to MySQL server with DSN ‘dbi:mysql:;mysql_read_default_file=/etc/mysql/my.cnf;mysql_read_default_group=xtrabackup;mysql_socket=/var/run/mysqld/mysqld.sock’ as ‘root’ (using password: YES).
131021 09:56:04 innobackupex: Connected to MySQL server
IMPORTANT: Please check that the backup run completes successfully.
At the end of a successful backup run innobackupex
prints “completed OK!”.

innobackupex: Using mysql server version 5.5.32-MariaDB-1~precise-log

innobackupex: Created backup directory /tmp

131021 09:56:04 innobackupex: Starting ibbackup with command: xtrabackup_55 --defaults-file="/etc/mysql/my.cnf" --defaults-group=“mysqld” --backup --suspend-at-end --target-dir=/tmp --tmpdir=/tmp --stream=tar
innobackupex: Waiting for ibbackup (pid=7498) to suspend
innobackupex: Suspend file ‘/tmp/xtrabackup_suspended_2’

xtrabackup_55 version 2.1.4 for Percona Server 5.5.31 Linux (x86_64) (revision id: 657)
xtrabackup: uses posix_fadvise().
xtrabackup: cd to /var/lib/mysql
xtrabackup: using the following InnoDB configuration:
xtrabackup: innodb_data_home_dir = ./
xtrabackup: innodb_data_file_path = ibdata1:10M:autoextend
xtrabackup: innodb_log_group_home_dir = ./
xtrabackup: innodb_log_files_in_group = 2
xtrabackup: innodb_log_file_size = 104857600
xtrabackup: using O_DIRECT

log scanned up to (124686760868)
131021 9:56:04 InnoDB: Warning: allocated tablespace 2125, old maximum was 9
131021 9:56:04 InnoDB: Operating system error number 24 in a file operation.
InnoDB: Error number 24 means ‘Too many open files’.
InnoDB: Some operating system error numbers are described at
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/operating-system-error-codes.html
InnoDB: Error: could not open single-table tablespace file
InnoDB: ./xxx/yyy.ibd!
InnoDB: We do not continue the crash recovery, because the table may become
InnoDB: corrupt if we cannot apply the log records in the InnoDB log to it.
InnoDB: To fix the problem and start mysqld:
InnoDB: 1) If there is a permission problem in the file and mysqld cannot
InnoDB: open the file, you should modify the permissions.
InnoDB: 2) If the table is not needed, or you can restore it from a backup,
InnoDB: then you can remove the .ibd file, and InnoDB will do a normal
InnoDB: crash recovery and ignore that table.
InnoDB: 3) If the file system or the disk is broken, and you cannot remove
InnoDB: the .ibd file, you can set innodb_force_recovery > 0 in my.cnf
InnoDB: and force InnoDB to continue crash recovery here.
innobackupex: Error: ibbackup child process has died at /usr//bin/innobackupex line 416.

I tried updating /etc/security/limits.conf to raise the os file limit to no avail. I also tried removing the idb file as suggested in 2) above, but sst then just reports a problem with another table.

Question: is there a way for me to get a list of all tables that would cause sst to fail without having to try to join the cluster from a non-primary node over and over again to go thorugh one table at a time? There are a lot of tables in some of these databases so I’d like to know what I’m getting myself into. What are the odds it will work in the end anyway? Anything else I should be trying? Or any other logs I can check to provide more info?

Any help is greatly appreciated, thank you!

Can you please send us output of:

xtrabackup --version
cat /etc/security/limits.conf
ulimit -a

Hi mixa,

[COLOR=#252C2F]$ xtrabackup --version
xtrabackup version 2.1.4 for Percona Server 5.1.70 unknown-linux-gnu (x86_64) (revision id: 657)

$ cat /etc/security/limits.conf

/etc/security/limits.conf

#Each line describes a limit for a user in the form:

#

#Where:
# can be:

- an user name

- a group name, with @group syntax

- the wildcard *, for default entry

- the wildcard %, can be also used with %group syntax,

for maxlogin limit

- NOTE: group and wildcard limits are not applied to root.

To apply a limit to the root user, must be

the literal username root.

# can have the two values:

- “soft” for enforcing the soft limits

- “hard” for enforcing hard limits

# can be one of the following:

- core - limits the core file size (KB)

- data - max data size (KB)

- fsize - maximum filesize (KB)

- memlock - max locked-in-memory address space (KB)

- nofile - max number of open files

- rss - max resident set size (KB)

- stack - max stack size (KB)

- cpu - max CPU time (MIN)

- nproc - max number of processes

- as - address space limit (KB)

- maxlogins - max number of logins for this user

- maxsyslogins - max number of logins on the system

- priority - the priority to run user process with

- locks - max number of file locks the user can hold

- sigpending - max number of pending signals

- msgqueue - max memory used by POSIX message queues (bytes)

- nice - max nice priority allowed to raise to values: [-20, 19]

- rtprio - max realtime priority

- chroot - change root to directory (Debian-specific)

#

#* soft core 0
#root hard core 100000
#* hard rss 10000
#@student hard nproc 20
#@faculty soft nproc 20
#@faculty hard nproc 50
#ftp hard nproc 0
#ftp - chroot /ftp
#@student - maxlogins 4

  • soft nofile 24000
  • hard nofile 32000

End of file

$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 128009
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 24000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 128009
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Thanks!

So it looks like the file limit has been successfully raised (and we have less than 24000 tables). How about the version of xtrabackup?

It seems that the “InnoDB: Error: could not open single-table tablespace file” problem is not uncommon but the reason behind it varies pretty widely. Could anyone help me to understand what is happening in this case? The files that innobackup.backup.log sites that I have checked on seem to exist and be the same size on both the primary node and the node trying to receive state.

Also, the error does seem to be consistent, in that if a secondary node attempts to join the cluster, it will continue to complain about the same table. If on the primary node I

  • dump the table
  • ALTER TABLE {table} DISCARD TABLESPACE
  • DROP TABLE {table}
  • import table from dump

and then try to join cluster again from secondary node, it will complain about a different table… implying it is now okay with the table it was previously having problems with? Also would be interested to know what is happening here.

Is there some way I can figure out all the idb files that are in a bad state without having to attempt to start mysql on the secondary node? If not, I reckon I could write a script that would just attempt to start mysql over and over again and recover the tables on the primary node as it went. Should I try this? Should I try something else?

Thanks again,
Kevin

Having less then 24000 tables does not mean 24k open files limit won’t be reached.
What is your innodb_open_files and open_files_limit ? What if you increase the latter? (http://dev.mysql.com/doc/refman/5.5/en/server-options.html#option_mysqld_open-files-limit)
The message in the error log is clear about that:
“131021 9:56:04 InnoDB: Operating system error number 24 in a file operation.
InnoDB: Error number 24 means ‘Too many open files’.”

Ha! That did it! Thanks a million przemek