Percona cluster - one server crashed - memory bug?

I have three servers in a cluster - running Percona XtraDB 5.6.22-72.0-56 on ubuntu 12.04, xeon systems with ECC ram and raid 5 arrays, 32GB ram each. Unlikely to be a hardware issue. This is what the error log said:

18:19:44 UTC - mysqld got signal 11 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Please help us make Percona XtraDB Cluster better by reporting any bugs at [URL]https://bugs.launchpad.net/percona-xtradb-cluster[/URL]

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=37
max_threads=153
thread_count=19
connection_count=2
It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 69252 K bytes of memory Hope that’s ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f2d18000990 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong…

stack_bottom = 7f30005e0a70 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x8e811e]
/usr/sbin/mysqld(handle_fatal_signal+0x392)[0x65ffa2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f303d3e5cb0]
/usr/lib/libgalera_smm.so(ZN6galera13Certification16purge for_trx_v3EPNS_9TrxHandleE+0xa0)[0x7f302225a0f0]
/usr/lib/libgalera_smm.so(ZN6galera13Certification16purge trxs_upto_Elb+0x158)[0x7f302225b8c8]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM18proces s_commit_cutEll+0x85)[0x7f3022288215]
/usr/lib/libgalera_smm.so(_ZN6galera15GcsActionSource8dispa tchEPvRK10gcs_actionRb+0x405)[0x7f3022269d75]
/usr/lib/libgalera_smm.so(_ZN6galera15GcsActionSource7proce ssEPvRb+0x5e)[0x7f302226a8ee]
/usr/lib/libgalera_smm.so(ZN6galera13ReplicatorSMM10async recvEPv+0x78)[0x7f302228f958]
/usr/lib/libgalera_smm.so(galera_recv+0x1e)[0x7f30222a4c8e] /usr/sbin/mysqld[0x5a491c]
/usr/sbin/mysqld(start_wsrep_THD+0x287)[0x58d247] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f303d3dde9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f303c8f88bd]

Trying to get some variables. Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 12
Status: NOT_KILLED

Trying to restart this server’s mysql the SST fails…
I accidentally started it in bootstrap mode and it ran for about a minute by itself, causing some writes to the databases. Then I edited my.cnf and restarted mysql service, expecting a full SST but get this error:

2015-04-28 12:53:54 45962 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role ‘joiner’ --address ‘10.X.X.X’ --auth ‘sstuser:sdfgdfghdry56’ --datadir ‘/var/lib/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --parent ‘45962’ ‘’ ’
WSREP_SST: [INFO] Streaming with tar (20150428 12:53:55.080)
WSREP_SST: [INFO] Using socat as streamer (20150428 12:53:55.081)
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | tar xfi - --recursive-unlink -h; RC=( ${PIPESTATUS[@]} ) (20150428 12:53:55.110)
2015-04-28 12:53:55 45962 [Note] WSREP: Prepared SST request: xtrabackup|10.x.X.X:4444/xtrabackup_sst
2015-04-28 12:53:55 45962 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2015-04-28 12:53:55 45962 [Note] WSREP: REPL Protocols: 5 (3, 1)
2015-04-28 12:53:55 45962 [Note] WSREP: Assign initial position for certification: 14184918, protocol version: 3
2015-04-28 12:53:55 45962 [Note] WSREP: Service thread queue flushed.
2015-04-28 12:53:55 45962 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.x.x.x:4568
2015-04-28 12:53:55 45962 [Note] WSREP: Node 1.0 (perc1) requested state transfer from ‘any’. Selected 0.0 (dxss3)(SYNCED) as donor.
2015-04-28 12:53:55 45962 [Note] WSREP: Shifting PRIMARY → JOINER (TO: 14184918)
2015-04-28 12:53:55 45962 [Note] WSREP: Requesting state transfer: success, donor: 0
2015-04-28 12:53:55 45962 [Note] WSREP: 0.0 (dxss3): State transfer to 1.0 (dxss1) complete.
2015-04-28 12:53:55 45962 [Note] WSREP: Member 0 (dxss3) synced with group.
WSREP_SST: [INFO] xtrabackup_ist received from donor: Running IST (20150428 12:53:55.425)
WSREP_SST: [INFO] Total time on joiner: 0 seconds (20150428 12:53:55.427)
WSREP_SST: [INFO] Removing the sst_in_progress file (20150428 12:53:55.428)
2015-04-28 12:53:55 45962 [Note] WSREP: SST complete, seqno: 14119148
2015-04-28 12:53:55 45962 [Note] Plugin ‘FEDERATED’ is disabled.
2015-04-28 12:53:55 45962 [Note] InnoDB: Using atomics to ref count buffer pool pages
2015-04-28 12:53:55 45962 [Note] InnoDB: The InnoDB memory heap is disabled
2015-04-28 12:53:55 45962 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2015-04-28 12:53:55 45962 [Note] InnoDB: Memory barrier is not used
2015-04-28 12:53:55 45962 [Note] InnoDB: Compressed tables use zlib 1.2.3.4
2015-04-28 12:53:55 45962 [Note] InnoDB: Using Linux native AIO
2015-04-28 12:53:55 45962 [Note] InnoDB: Using CPU crc32 instructions
2015-04-28 12:53:55 45962 [Note] InnoDB: Initializing buffer pool, size = 10.0G
2015-04-28 12:53:55 45962 [Note] InnoDB: Completed initialization of buffer pool
2015-04-28 12:53:55 45962 [Note] InnoDB: Highest supported file format is Barracuda.
2015-04-28 12:53:56 45962 [Note] InnoDB: 128 rollback segment(s) are active.
2015-04-28 12:53:56 45962 [Note] InnoDB: Waiting for purge to start
2015-04-28 12:53:56 45962 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.22-72.0 started; log sequence number 30463102736
2015-04-28 12:53:56 45962 [Note] RSA private key file not found: /var/lib/mysql//private_key.pem. Some authentication plugins will not work.
2015-04-28 12:53:56 45962 [Note] RSA public key file not found: /var/lib/mysql//public_key.pem. Some authentication plugins will not work.
2015-04-28 12:53:56 45962 [Note] Server hostname (bind-address): ‘*’; port: 3306
2015-04-28 12:53:56 45962 [Note] IPv6 is available.
2015-04-28 12:53:56 45962 [Note] - ‘::’ resolves to ‘::’;
2015-04-28 12:53:56 45962 [Note] Server socket created on IP: ‘::’.
2015-04-28 12:53:56 45962 [Note] Event Scheduler: Loaded 0 events
2015-04-28 12:53:56 45962 [Note] WSREP: Signalling provider to continue.
2015-04-28 12:53:56 45962 [Note] WSREP: inited wsrep sidno 1
2015-04-28 12:53:56 45962 [Note] WSREP: SST received: 5b18cbf7-sdfgsdfgsdfg8379-11e3-92395df271:14119148
2015-04-28 12:53:56 45962 [Note] WSREP: Receiving IST: 65770 writesets, seqnos 14119148-14184918
2015-04-28 12:53:56 45962 [Note] /usr/sbin/mysqld: ready for connections.
Version: ‘5.6.22-72.0-56’ socket: ‘/var/run/mysqld/mysqld.sock’ port: 3306 Percona XtraDB Cluster (GPL), Release rel72.0, Revision 978, WSREP version 25.8, wsrep_25.8.r4150
2015-04-28 12:53:56 45962 [ERROR] Slave SQL: Could not execute Delete_rows event on table dxss.codesecrets; Can’t find record in ‘codesecrets’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event’s master log FIRST, end_log_pos 213, Error_code: 1032
2015-04-28 12:53:56 45962 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 14119181
2015-04-28 12:53:56 45962 [Warning] WSREP: Failed to apply app buffer: seqno: 14119181, status: 1
at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2015-04-28 12:53:56 45962 [ERROR] Slave SQL: Could not execute Delete_rows event on table dxss.codesecrets; Can’t find record in ‘codesecrets’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event’s master log FIRST, end_log_pos 213, Error_code: 1032
2015-04-28 12:53:56 45962 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 14119181
2015-04-28 12:53:56 45962 [Warning] WSREP: Failed to apply app buffer: seqno: 14119181, status: 1
at galera/src/trx_handle.cpp:apply():340
Retrying 3th time
2015-04-28 12:53:56 45962 [ERROR] Slave SQL: Could not execute Delete_rows event on table dxss.codesecrets; Can’t find record in ‘codesecrets’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event’s master log FIRST, end_log_pos 213, Error_code: 1032
2015-04-28 12:53:56 45962 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 14119181
2015-04-28 12:53:56 45962 [Warning] WSREP: Failed to apply app buffer: seqno: 14119181, status: 1
at galera/src/trx_handle.cpp:apply():340
Retrying 4th time
2015-04-28 12:53:56 45962 [ERROR] Slave SQL: Could not execute Delete_rows event on table dxss.codesecrets; Can’t find record in ‘codesecrets’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event’s master log FIRST, end_log_pos 213, Error_code: 1032
2015-04-28 12:53:56 45962 [Warning] WSREP: RBR event 3 Delete_rows apply warning: 120, 14119181
2015-04-28 12:53:56 45962 [ERROR] WSREP: receiving IST failed, node restart required: Failed to apply trx 14119181 4 times
2015-04-28 12:53:56 45962 [Note] WSREP: Closing send monitor…

It seems I must delete the local innodb data files and somehow force a complete SST. The other two systems in the cluster are running fine.

Thanks
Francois

Still trying to join the cluster. I deleted all GRA and grastate files from /var/lib/mysql. This forces a SST in stead of a IST. Deleted the related database directories as well and tried to join. The joiner gives me this error:

WSREP_SST: [ERROR] xtrabackup process ended without creating ‘/var/lib/mysql//xtrabackup_galera_info’

It created three other files in /var/lib/mysql:
xtrabackup_checkpoints
xtrabackup_info
xtrabackup_logfile

aaaaaaand problem found: The failed system had this line in my.cnf:
wsrep_sst_method=xtrabackup

While in the other two running systems, this line was commented out. So I commented out the line, deleted the grastate file again and reqstarted mysql service. It forced a full SST, and from what I can see in the logfile, the default sst method is now set to xtrabackup-v2

Problem solved. For now. The original crash still baffles.