Freshly installed PXC 5.5.41-37.0 box aborts with no diagnostics after successful SST

Hi there,

I’ve inherited a Galera cluster and am bringing it up to the current 5.5.41-37.0 point release from 5.5.34-25.9. The two Debian based nodes upgraded fine, but the 3rd, a RHEL6 based box (OS due to circumstances outside of our control) is having no end of problems.

First of all we hit a problem where the previously configured SST method (rsync) would time out during the transfer, aborting the SST.

Switching to xtrabackup-v2 with xbstream solved that problem, but now after the successful SST I just see:

[...]
Jun 8 11:43:54 hostname -wsrep-sst-joiner: Move successful, removing /var/lib/mysql//.sst
Jun 8 11:43:54 hostname -wsrep-sst-joiner: Galera co-ords from recovery: 874d7052-1e62-11e3-aa8d-97e571a97d4a:126724943
Jun 8 11:43:54 hostname -wsrep-sst-joiner: Total time on joiner: 0 seconds
Jun 8 11:43:54 hostname -wsrep-sst-joiner: Removing the sst_in_progress file
Jun 8 11:43:54 hostname mysqld: 160608 11:43:54 [Note] WSREP: SST complete, seqno: 126724943
Jun 8 11:43:54 hostname mysqld: 160608 11:43:54 [Note] Plugin 'FEDERATED' is disabled.
Jun 8 11:43:54 hostname mysqld: 160608 11:43:54 InnoDB: The InnoDB memory heap is disabled
Jun 8 11:43:54 hostname mysqld: 160608 11:43:54 InnoDB: Mutexes and rw_locks use GCC atomic builtins
Jun 8 11:43:54 hostname mysqld: 160608 11:43:54 InnoDB: Compressed tables use zlib 1.2.3
Jun 8 11:43:54 hostname mysqld: 160608 11:43:54 InnoDB: Using Linux native AIO
Jun 8 11:43:54 hostname mysqld: 160608 11:43:54 InnoDB: Initializing buffer pool, size = 22.0G
Jun 8 11:43:55 hostname mysqld: 160608 11:43:55 InnoDB: Completed initialization of buffer pool
Jun 8 11:43:55 hostname mysqld: 160608 11:43:55 InnoDB: highest supported file format is Barracuda.
Jun 8 11:43:57 hostname mysqld: 160608 11:43:57 InnoDB: Waiting for the background threads to start
Jun 8 11:43:58 hostname mysqld: 160608 11:43:58 Percona XtraDB (http://www.percona.com) 5.5.41-37.0 started; log sequence number 876619301144
Jun 8 11:43:58 hostname mysqld: 160608 11:43:58 [ERROR] Aborting
Jun 8 11:43:58 hostname mysqld:
Jun 8 11:44:00 hostname mysqld: 160608 11:44:00 [Note] WSREP: Closing send monitor...
Jun 8 11:44:00 hostname mysqld: 160608 11:44:00 [Note] WSREP: Closed send monitor.
[...]

That’s all that’s logged in the syslog info (mysqld_safe is set to use syslog in the config files).

I’ve run innochecksum on all the InnoDB files pulled over by the SST and they look fine.

As this node is provisioned via xCAT I’ve reinstalled it to see if there was something odd locally and it breaks in the same way.

Basically as far as I can tell all I have so far to go on is:

Jun 8 11:43:58 hostname mysqld: 160608 11:43:58 [ERROR] Aborting

which isn’t a great deal to go on, to say the least. :-/

Any ideas on how to debug this please?

All the best,
Chris

Chris,

Can you post more of messages, before
11:43:54 hostname -wsrep-sst-joiner: Move successful, removing /var/lib/mysql//.sst If you can post from the moment mysqld started, that would be helpful.
Is there any messages in dmesg?

Any reason you are still using 5.5? I’d try with 5.6 too

Hiya,

Interesting, I’d been concentrating on the lines:

Jun 8 11:43:53 barcoo-test -innobackupex-move: 160608 11:43:53 [01] Moving ./gitorious/messages.ibd to /var/lib/mysql/gitorious/messages.ibd
Jun 8 11:43:53 barcoo-test -innobackupex-move: 160608 11:43:53 [01] ...done
Jun 8 11:43:53 barcoo-test -innobackupex-move: 160608 11:43:53 [01] Moving ./gitorious/taggings.ibd to /var/lib/mysql/gitorious/taggings.ibd
Jun 8 11:43:53 barcoo-test -innobackupex-move: 160608 11:43:53 [01] ...done
Jun 8 11:43:53 barcoo-test -innobackupex-move: 160608 11:43:53 [01] Moving ./gitorious/merge_request_versions.frm to /var/lib/mysql/gitorious/merge_request_versions.frm
Jun 8 11:43:53 barcoo-test -innobackupex-move: 160608 11:43:53 [01] ...done
Jun 8 11:43:53 barcoo-test -innobackupex-move: 160608 11:43:53 [01] Moving ./gitorious/merge_requests.frm to /var/lib/mysql/gitorious/merge_requests.frm
Jun 8 11:43:53 barcoo-test -innobackupex-move: 160608 11:43:53 [01] ...done
Jun 8 11:43:53 barcoo-test -innobackupex-move: 160608 11:43:53 [01] Moving ./gitorious/groups.frm to /var/lib/mysql/gitorious/groups.frm
Jun 8 11:43:53 barcoo-test -innobackupex-move: 160608 11:43:53 [01] ...done
Jun 8 11:43:53 barcoo-test -innobackupex-move: 160608 11:43:53 [01] Moving ./gitorious/archived_events.frm to /var/lib/mysql/gitorious/archived_events.frm

but interestingly the line immediately after that and before the “Move successful” one is:

Jun 8 11:43:53 barcoo-test rsyslogd-2177: imuxsock begins to drop messages from pid 17357 due to rate-limiting

So I’ll redo this SST sending the logs elsewhere so that’s not an issue. I wonder what that’s hiding? Should have spotted it before… :frowning:

Why not 5.6? Because I’ve inherited this and don’t want to make any big changes whilst it’s in production. Plus having 1 node down out of a 3 node cluster means bad things will happen if I take another node out to try and do a rolling upgrade. :frowning:

All the best,
Chris

OK rerun finished, but it appears it doesn’t capture the xtrabackup output as before, so all I see is it clean up the /var/lib/mysql directory first and then it logs:

removed directory: `/var/lib/mysql/rtdb'
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20160609 13:40:00.588)
160609 13:54:51 [Note] WSREP: 0 (db3): State transfer to 1 (barcoo-test) complete.
160609 13:54:51 [Note] WSREP: Member 0 (db3) synced with group.
WSREP_SST: [INFO] Preparing the backup at /var/lib/mysql//.sst (20160609 13:55:05.014)
WSREP_SST: [INFO] Evaluating innobackupex --no-version-check --apply-log $rebuildcmd ${DATA} &>${DATA}/innobackup.prepare.log (20160609 13:55:05.018)
rm: cannot remove `/var/lib/mysql//innobackup.prepare.log': No such file or directory
rm: cannot remove `/var/lib/mysql//innobackup.move.log': No such file or directory
WSREP_SST: [INFO] Moving the backup to /var/lib/mysql/ (20160609 13:55:18.729)
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/my.cnf --no-version-check --move-back --force-non-empty-directories ${DATA} &>${DATA}/innobackup.move.log (20160609 13:55:18.733)
WSREP_SST: [INFO] Move successful, removing /var/lib/mysql//.sst (20160609 13:55:18.823)
WSREP_SST: [INFO] Galera co-ords from recovery: 874d7052-1e62-11e3-aa8d-97e571a97d4a:126804399 (20160609 13:55:18.832)
WSREP_SST: [INFO] Total time on joiner: 0 seconds (20160609 13:55:18.836)
WSREP_SST: [INFO] Removing the sst_in_progress file (20160609 13:55:18.839)
160609 13:55:18 [Note] WSREP: SST complete, seqno: 126804399
160609 13:55:18 [Note] Plugin 'FEDERATED' is disabled.
160609 13:55:18 InnoDB: The InnoDB memory heap is disabled
160609 13:55:18 InnoDB: Mutexes and rw_locks use GCC atomic builtins
160609 13:55:18 InnoDB: Compressed tables use zlib 1.2.3
160609 13:55:18 InnoDB: Using Linux native AIO
160609 13:55:18 InnoDB: Initializing buffer pool, size = 22.0G
160609 13:55:20 InnoDB: Completed initialization of buffer pool
160609 13:55:20 InnoDB: highest supported file format is Barracuda.
160609 13:55:22 InnoDB: Waiting for the background threads to start
160609 13:55:23 Percona XtraDB (http://www.percona.com) 5.5.41-37.0 started; log sequence number 877804777070
160609 13:55:23 [ERROR] Aborting

160609 13:55:25 [Note] WSREP: Closing send monitor...
160609 13:55:25 [Note] WSREP: Closed send monitor.
160609 13:55:25 [Note] WSREP: gcomm: terminating thread
160609 13:55:25 [Note] WSREP: gcomm: joining thread
160609 13:55:25 [Note] WSREP: gcomm: closing backend
160609 13:55:25 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
[...]

I’ve got 2 hours of meetings now so I might not get a chance to try and redo this again until tomorrow now. :frowning:

Thanks!
Chris

So I decided to downgrade the packages to the previous versions (we’re installing from a local repo) and I can confirm that it SST’s correctly and synchronises with the rest of the cluster successfully. We’re going to try and spin up a Debian 7 VM with the same packages as our existing Debian physical nodes to see if that has the same issue as the RHEL box.

Here’s the log from the older version as it successfully SST’s. Do note that to downgrade back to this version I had to do:

rpm -e --nodeps mysql-libs-5.1.73-5.el6_6.x86_64

…and then install the compatibility libs as there’s a conflict between the PXC RPMs and the RHEL6 ones in this version - it’s not necessary for the latest 5.5 and in fact can’t be done as you’ll end up with missing mysql libraries with version 16 which things like Postfix rely on.

Jun 9 15:33:43 hostname mysqld_safe: Starting mysqld daemon with databases from /var/lib/mysql
Jun 9 15:33:43 hostname mysqld_safe: Skipping wsrep-recover for empty datadir: /var/lib/mysql
Jun 9 15:33:43 hostname mysqld_safe: Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: Read nil XID from storage engines, skipping position init
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: wsrep_load(): Galera 2.8(r165) by Codership Oy <info&#64;codership.com> loaded successfully.
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Warning] WSREP: Could not open saved state file for reading: /var/lib/mysql//grastate.dat
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: Preallocating 134219048/134219048 bytes in '/var/lib/mysql//galera.cache'...
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: Passing config to GCS: base_host = 10.3.0.60; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: wsrep_sst_grab()
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: Start replication
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: protonet asio version 0
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: backend: asio
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: GMCast version 0
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: (bb0789d6-2e03-11e6-85f8-9e0977225df2, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: (bb0789d6-2e03-11e6-85f8-9e0977225df2, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: EVS version 0
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: PC version 0
Jun 9 15:33:43 hostname mysqld: 160609 15:33:43 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.11.0.40:,10.11.0.41:'
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: declaring 45d51314-2b98-11e6-b605-d2ac634302f6 stable
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: declaring d41e23c0-2b9b-11e6-80ca-7bb6c4e547d6 stable
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: Node 45d51314-2b98-11e6-b605-d2ac634302f6 state prim
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: view(view_id(PRIM,45d51314-2b98-11e6-b605-d2ac634302f6,122) memb {
Jun 9 15:33:44 hostname mysqld: #01145d51314-2b98-11e6-b605-d2ac634302f6,
Jun 9 15:33:44 hostname mysqld: #011bb0789d6-2e03-11e6-85f8-9e0977225df2,
Jun 9 15:33:44 hostname mysqld: #011d41e23c0-2b9b-11e6-80ca-7bb6c4e547d6,
Jun 9 15:33:44 hostname mysqld: } joined {
Jun 9 15:33:44 hostname mysqld: } left {
Jun 9 15:33:44 hostname mysqld: } partitioned {
Jun 9 15:33:44 hostname mysqld: })
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: gcomm: connected
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: Opened channel 'my_wsrep_cluster'
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: Waiting for SST to complete.
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: STATE EXCHANGE: sent state msg: bb547472-2e03-11e6-a890-52d118c26d50
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: STATE EXCHANGE: got state msg: bb547472-2e03-11e6-a890-52d118c26d50 from 0 (db3)
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: STATE EXCHANGE: got state msg: bb547472-2e03-11e6-a890-52d118c26d50 from 2 (db4)
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: STATE EXCHANGE: got state msg: bb547472-2e03-11e6-a890-52d118c26d50 from 1 (hostname)
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: Quorum results:
Jun 9 15:33:44 hostname mysqld: #011version = 2,
Jun 9 15:33:44 hostname mysqld: #011component = PRIMARY,
Jun 9 15:33:44 hostname mysqld: #011conf_id = 120,
Jun 9 15:33:44 hostname mysqld: #011members = 2/3 (joined/total),
Jun 9 15:33:44 hostname mysqld: #011act_id = 126813177,
Jun 9 15:33:44 hostname mysqld: #011last_appl. = -1,
Jun 9 15:33:44 hostname mysqld: #011protocols = 0/4/2 (gcs/repl/appl),
Jun 9 15:33:44 hostname mysqld: #011group UUID = 874d7052-1e62-11e3-aa8d-97e571a97d4a
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: Flow-control interval: [28, 28]
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 126813177)
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: State transfer required:
Jun 9 15:33:44 hostname mysqld: #011Group state: 874d7052-1e62-11e3-aa8d-97e571a97d4a:126813177
Jun 9 15:33:44 hostname mysqld: #011Local state: 00000000-0000-0000-0000-000000000000:-1
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Note] WSREP: New cluster view: global state: 874d7052-1e62-11e3-aa8d-97e571a97d4a:126813177, view# 121: Primary, number of nodes: 3, my index: 1, protocol version 2
Jun 9 15:33:44 hostname mysqld: 160609 15:33:44 [Warning] WSREP: Gap in state sequence. Need state transfer.
Jun 9 15:33:46 hostname mysqld: 160609 15:33:46 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.3.0.60' --auth 'vlscisst:TooF3agh' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '4736''
Jun 9 15:33:47 hostname mysqld: WSREP_SST: [INFO] Streaming with xbstream (20160609 15:33:47.042)
Jun 9 15:33:47 hostname mysqld: WSREP_SST: [INFO] Using socat as streamer (20160609 15:33:47.045)
Jun 9 15:33:47 hostname mysqld: WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[&#64;]} ) (20160609 15:33:47.065)
Jun 9 15:33:47 hostname mysqld: 160609 15:33:47 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.3.0.60:4444/xtrabackup_sst
Jun 9 15:33:47 hostname mysqld: 160609 15:33:47 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Jun 9 15:33:47 hostname mysqld: 160609 15:33:47 [Note] WSREP: Assign initial position for certification: 126813177, protocol version: 2
Jun 9 15:33:47 hostname mysqld: 160609 15:33:47 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (874d7052-1e62-11e3-aa8d-97e571a97d4a): 1 (Operation not permitted)
Jun 9 15:33:47 hostname mysqld: #011 at galera/src/replicator_str.cpp:prepare_for_IST():445. IST will be unavailable.
Jun 9 15:33:47 hostname mysqld: 160609 15:33:47 [Note] WSREP: Node 1 (hostname) requested state transfer from '*any*'. Selected 0 (db3)(SYNCED) as donor.
Jun 9 15:33:47 hostname mysqld: 160609 15:33:47 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 126813177)
Jun 9 15:33:47 hostname mysqld: 160609 15:33:47 [Note] WSREP: Requesting state transfer: success, donor: 0
Jun 9 15:33:47 hostname mysqld: WSREP_SST: [INFO] Proceeding with SST (20160609 15:33:47.996)
Jun 9 15:33:48 hostname mysqld: WSREP_SST: [INFO] Cleaning the existing datadir (20160609 15:33:47.999)
Jun 9 15:33:48 hostname mysqld: WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[&#64;]} ) (20160609 15:33:48.006)

This section was too long to fit into the previous comment, so tagging it on here… :wink:

Please note that all that has changed are the RPMs, the config files are pushed out by xCAT and so are identical between the two attempts.

The RPM’s that are now installed on the box are:

Percona-XtraDB-Cluster-galera-2-2.8-1.157.rhel6.x86_64
percona-toolkit-2.2.11-1.noarch
Percona-XtraDB-Cluster-shared-55-5.5.34-25.9.607.rhel6.x86_64
Percona-Server-shared-compat-5.1.68-rel14.6.551.rhel6.x86_64
percona-xtrabackup-2.3.4-1.el6.x86_64
Percona-XtraDB-Cluster-client-55-5.5.34-25.9.607.rhel6.x86_64
Percona-XtraDB-Cluster-server-55-5.5.34-25.9.607.rhel6.x86_64

All the best,
Chris

It turns out that a Debian VM built with the PXC Debian packages for the same release, with the same configuration (except innodb_buffer_pool_size shrunk to fit in the VM’s memory space), works OK. :-/