Hi,
I have a three node XtraDB Cluster which was running fine until November 2nd. Somewhere after then, one node gave up. The node was not answering to requests and the DB was inaccessible because the process had stopped.
I rebooted the machine and when I started the DB with command service mysql start, the start failed. Part of the log file is as below. It’s the only part where I can see something go wrong.
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: Prepared SST request: xtrabackup-v2|172.16.1.14:4444/xtrabackup_sst//1
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: REPL Protocols: 7 (3, 2)
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: Service thread queue flushed.
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: Assign initial position for certification: 17272975, protocol version: 3
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: Service thread queue flushed.
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: Prepared IST receiver, listening at: tcp://172.16.1.14:4568
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: Member 1.0 (host1) requested state transfer from ‘any’. Selected 0.0 (host3)(SYNCED) as donor.
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: Shifting PRIMARY → JOINER (TO: 17272985)
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: Requesting state transfer: success, donor: 0
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Warning] WSREP: 0.0 (host3): State transfer to 1.0 (host1) failed: -12 (Cannot allocate memory)
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():731: Will never receive state. Need to abort.
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: gcomm: terminating thread
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: gcomm: joining thread
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: gcomm: closing backend
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: view(view_id(NON_PRIM,542eb23e,56) memb {
Nov 4 20:32:47 host1.domain.net mysqld: #011747ab10a,0
Nov 4 20:32:47 host1.domain.net mysqld: } joined {
Nov 4 20:32:47 host1.domain.net mysqld: } left {
Nov 4 20:32:47 host1.domain.net mysqld: } partitioned {
Nov 4 20:32:47 host1.domain.net mysqld: #011542eb23e,0
Nov 4 20:32:47 host1.domain.net mysqld: #01189558737,0
Nov 4 20:32:47 host1.domain.net mysqld: })
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: view((empty))
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: gcomm: closed
Nov 4 20:32:47 host1.domain.net mysqld: 2016-11-04 20:32:47 6050 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Anyone ever seen this before and knows what can be done about it? I have no clue at this point in time, especially as the server was running fine until two days ago.
By the way, nothing changed on any of the cluster members in the past two months.
Thanks.