garbd not alway able to start/join

I’ve successfully created a 3-node cluster, with the 3rd node consisting of a Galera Abritrator (garbd).

The problem I have encountered, is that garbd sometimes terminates and I am having a bit difficulty finding the exact cause.

When it goes bad, I get the following:

2012-11-27 20:09:23.744 INFO: Read config: daemon: 0 address: gcomm://192.168.15.1 group: my_cluster sst: trivial donor: options: gcs.fc_limit=9999999; gcs.fc_factor=1.0; gcs.fc_master_slave=yes cfg: log: /var/log/garbd.log2012-11-27 20:09:24.022 INFO: protonet asio version 02012-11-27 20:09:24.022 INFO: backend: asio2012-11-27 20:09:24.023 INFO: GMCast version 02012-11-27 20:09:24.023 INFO: (3f56b880-38f8-11e2-0800-e0f1166e3e03, ‘tcp://0.0.0.0:4567’) listening at tcp://0.0.0.0:45672012-11-27 20:09:24.023 INFO: (3f56b880-38f8-11e2-0800-e0f1166e3e03, ‘tcp://0.0.0.0:4567’) multicast: , ttl: 12012-11-27 20:09:24.024 INFO: EVS version 02012-11-27 20:09:24.024 INFO: PC version 02012-11-27 20:09:24.024 INFO: gcomm: connecting to group ‘my_cluster’, peer '192.168.15.1:'2012-11-27 20:09:24.305 INFO: (3f56b880-38f8-11e2-0800-e0f1166e3e03, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://192.168.15.6:4567 2012-11-27 20:09:26.432 INFO: (3f56b880-38f8-11e2-0800-e0f1166e3e03, ‘tcp://0.0.0.0:4567’) turning message relay requesting off2012-11-27 20:09:26.992 INFO: (3f56b880-38f8-11e2-0800-e0f1166e3e03, ‘tcp://0.0.0.0:4567’) cleaning up duplicate 0x21f63b0 after established 0x2207e902012-11-27 20:09:27.599 INFO: declaring 6b2cf467-3683-11e2-0800-8ab67f697dd9 stable2012-11-27 20:09:27.886 INFO: declaring f382ea77-3684-11e2-0800-7e392de96714 stable2012-11-27 20:09:28.386 WARN: last inactive check more than PT1.5S ago, skipping check2012-11-27 20:09:29.185 INFO: view(view_id(PRIM,3f56b880-38f8-11e2-0800-e0f1166e3e03,149) memb { 3f56b880-38f8-11e2-0800-e0f1166e3e03, 6b2cf467-3683-11e2-0800-8ab67f697dd9, f382ea77-3684-11e2-0800-7e392de96714,} joined {} left {} partitioned {})2012-11-27 20:09:30.142 WARN: last inactive check more than PT1.5S ago, skipping check2012-11-27 20:09:30.142 INFO: gcomm: connected2012-11-27 20:09:30.142 INFO: Changing maximum packet size to 64500, resulting msg size: 326362012-11-27 20:09:30.913 INFO: Shifting CLOSED → OPEN (TO: 0)2012-11-27 20:09:31.239 INFO: Opened channel 'my_cluster’2012-11-27 20:09:31.239 INFO: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 32012-11-27 20:09:31.240 INFO: STATE_EXCHANGE: sent state UUID: 43a3dfb0-38f8-11e2-0800-47536525193f2012-11-27 20:09:31.373 INFO: STATE EXCHANGE: sent state msg: 43a3dfb0-38f8-11e2-0800-47536525193f2012-11-27 20:09:33.999 INFO: STATE EXCHANGE: got state msg: 43a3dfb0-38f8-11e2-0800-47536525193f from 0 (garb)2012-11-27 20:09:33.999 INFO: STATE EXCHANGE: got state msg: 43a3dfb0-38f8-11e2-0800-47536525193f from 1 (server-1)2012-11-27 20:09:33.999 INFO: STATE EXCHANGE: got state msg: 43a3dfb0-38f8-11e2-0800-47536525193f from 2 (server-2)2012-11-27 20:09:33.999 INFO: Quorum results: version = 2, component = PRIMARY, conf_id = 92, members = 2/3 (joined/total), act_id = 1250, last_appl. = -1, protocols = 0/4/2 (gcs/repl/appl), group UUID = ab735913-3678-11e2-0800-e16a07f68d0f2012-11-27 20:09:33.999 INFO: Flow-control interval: [9999999, 9999999]2012-11-27 20:09:33.999 INFO: Shifting OPEN → PRIMARY (TO: 1250)2012-11-27 20:09:33.999 INFO: Sending state transfer request: ‘trivial’, size: 72012-11-27 20:09:34.132 INFO: Node 0 (garb) requested state transfer from ‘any’. Selected 1 (server-1)(SYNCED) as donor.2012-11-27 20:09:40.078 INFO: Shifting PRIMARY → JOINER (TO: 1250)2012-11-27 20:09:40.078 WARN: 1 (server-1): State transfer to 0 (garb) failed: -1 (Operation not permitted)2012-11-27 20:09:40.078 FATAL: gcs/src/gcs_group.c:gcs_group_handle_join_msg():712: Will never receive state. Need to abort.2012-11-27 20:09:40.078 INFO: gcomm: terminating thread2012-11-27 20:09:40.078 INFO: gcomm: joining thread2012-11-27 20:09:40.078 INFO: gcomm: closing backend2012-11-27 20:09:40.078 WARN: user message in state LEAVING2012-11-27 20:09:40.078 WARN: Proto::handle_down: Transport endpoint is not connected2012-11-27 20:09:40.078 WARN: Sending JOIN failed: -107 (Transport endpoint is not connected). Will retry in new primary component.2012-11-27 20:09:40.220 INFO: view(view_id(NON_PRIM,3f56b880-38f8-11e2-0800-e0f1166e3e03,149) memb { 3f56b880-38f8-11e2-0800-e0f1166e3e03,} joined {} left {} partitioned { 6b2cf467-3683-11e2-0800-8ab67f697dd9, f382ea77-3684-11e2-0800-7e392de96714,})2012-11-27 20:09:40.220 INFO: view((empty))2012-11-27 20:09:40.221 INFO: gcomm: closed2012-11-27 20:09:40.221 INFO: /usr/bin/garbd: Terminated.

When it works ok, the log shows:

2012-11-27 20:30:30.492 INFO: Read config: daemon: 0 address: gcomm://192.168.15.6 group: my_cluster sst: trivial donor: options: gcs.fc_limit=9999999; gcs.fc_factor=1.0; gcs.fc_master_slave=yes cfg: log: /var/log/garbd.log2012-11-27 20:30:30.493 INFO: protonet asio version 02012-11-27 20:30:30.493 INFO: backend: asio2012-11-27 20:30:30.494 INFO: GMCast version 02012-11-27 20:30:30.494 INFO: (3236d82c-38fb-11e2-0800-9398a6bedbd7, ‘tcp://0.0.0.0:4567’) listening at tcp://0.0.0.0:45672012-11-27 20:30:30.494 INFO: (3236d82c-38fb-11e2-0800-9398a6bedbd7, ‘tcp://0.0.0.0:4567’) multicast: , ttl: 12012-11-27 20:30:30.494 INFO: EVS version 02012-11-27 20:30:30.494 INFO: PC version 02012-11-27 20:30:30.494 INFO: gcomm: connecting to group ‘my_cluster’, peer '192.168.15.6:'2012-11-27 20:30:30.667 INFO: (3236d82c-38fb-11e2-0800-9398a6bedbd7, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://192.168.15.1:4567 2012-11-27 20:30:31.688 INFO: (3236d82c-38fb-11e2-0800-9398a6bedbd7, ‘tcp://0.0.0.0:4567’) turning message relay requesting off2012-11-27 20:30:31.781 INFO: (3236d82c-38fb-11e2-0800-9398a6bedbd7, ‘tcp://0.0.0.0:4567’) cleaning up duplicate 0x26663d0 after established 0x2677eb02012-11-27 20:30:31.806 INFO: declaring 6b2cf467-3683-11e2-0800-8ab67f697dd9 stable2012-11-27 20:30:31.806 INFO: declaring f382ea77-3684-11e2-0800-7e392de96714 stable2012-11-27 20:30:32.306 WARN: last inactive check more than PT1.5S ago, skipping check2012-11-27 20:30:33.130 INFO: view(view_id(PRIM,3236d82c-38fb-11e2-0800-9398a6bedbd7,155) memb { 3236d82c-38fb-11e2-0800-9398a6bedbd7, 6b2cf467-3683-11e2-0800-8ab67f697dd9, f382ea77-3684-11e2-0800-7e392de96714,} joined {} left {} partitioned {})2012-11-27 20:30:33.140 INFO: gcomm: connected2012-11-27 20:30:33.140 INFO: Changing maximum packet size to 64500, resulting msg size: 326362012-11-27 20:30:33.140 INFO: Shifting CLOSED → OPEN (TO: 0)2012-11-27 20:30:33.141 INFO: Opened channel 'my_cluster’2012-11-27 20:30:33.141 INFO: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 32012-11-27 20:30:33.141 INFO: STATE_EXCHANGE: sent state UUID: 33cac840-38fb-11e2-0800-e4c4ef8959b42012-11-27 20:30:33.274 INFO: STATE EXCHANGE: sent state msg: 33cac840-38fb-11e2-0800-e4c4ef8959b42012-11-27 20:30:33.394 INFO: STATE EXCHANGE: got state msg: 33cac840-38fb-11e2-0800-e4c4ef8959b4 from 0 (garb)2012-11-27 20:30:33.394 INFO: STATE EXCHANGE: got state msg: 33cac840-38fb-11e2-0800-e4c4ef8959b4 from 1 (server-1)2012-11-27 20:30:33.394 INFO: STATE EXCHANGE: got state msg: 33cac840-38fb-11e2-0800-e4c4ef8959b4 from 2 (server-2)2012-11-27 20:30:33.394 INFO: Quorum results: version = 2, component = PRIMARY, conf_id = 98, members = 2/3 (joined/total), act_id = 1255, last_appl. = -1, protocols = 0/4/2 (gcs/repl/appl), group UUID = ab735913-3678-11e2-0800-e16a07f68d0f2012-11-27 20:30:33.394 INFO: Flow-control interval: [9999999, 9999999]2012-11-27 20:30:33.394 INFO: Shifting OPEN → PRIMARY (TO: 1255)2012-11-27 20:30:33.394 INFO: Sending state transfer request: ‘trivial’, size: 72012-11-27 20:30:33.527 INFO: Node 0 (garb) requested state transfer from ‘any’. Selected 1 (server-1)(SYNCED) as donor.2012-11-27 20:30:33.527 INFO: Shifting PRIMARY → JOINER (TO: 1255)2012-11-27 20:30:33.658 INFO: 0 (garb): State transfer from 1 (server-1) complete.2012-11-27 20:30:33.658 INFO: Shifting JOINER → JOINED (TO: 1255)2012-11-27 20:30:33.658 WARN: 1 (server-1): State transfer to 0 (garb) failed: -1 (Operation not permitted)2012-11-27 20:30:33.792 INFO: Member 0 (garb) synced with group.2012-11-27 20:30:33.792 INFO: Shifting JOINED → SYNCED (TO: 1255)2012-11-27 20:30:33.792 INFO: Member 1 (server-1) synced with group.

The corresponding logs on server-1, when garbd terminates:

121128 4:08:38 [Note] WSREP: (386bebb4-3908-11e2-0800-09c99c448bc8, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://192.168.15.5:4567 121128 4:08:39 [Note] WSREP: (386bebb4-3908-11e2-0800-09c99c448bc8, ‘tcp://0.0.0.0:4567’) turning message relay requesting off121128 4:08:39 [Note] WSREP: (386bebb4-3908-11e2-0800-09c99c448bc8, ‘tcp://0.0.0.0:4567’) cleaning up duplicate 0x14697c0 after established 0x7fc2a4849b60121128 4:08:39 [Note] WSREP: declaring 543d7351-3908-11e2-0800-c2f92b164aeb stable121128 4:08:39 [Note] WSREP: declaring e7e5072e-3908-11e2-0800-e0db842b57ec stable121128 4:08:40 [Note] WSREP: view(view_id(PRIM,386bebb4-3908-11e2-0800-09c99c448bc8,205) memb { 386bebb4-3908-11e2-0800-09c99c448bc8, 543d7351-3908-11e2-0800-c2f92b164aeb, e7e5072e-3908-11e2-0800-e0db842b57ec,} joined {} left {} partitioned {})121128 4:08:40 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3121128 4:08:40 [Note] WSREP: STATE_EXCHANGE: sent state UUID: e8a69ba0-3908-11e2-0800-82414bc10955121128 4:08:40 [Note] WSREP: STATE EXCHANGE: sent state msg: e8a69ba0-3908-11e2-0800-82414bc10955121128 4:08:40 [Note] WSREP: STATE EXCHANGE: got state msg: e8a69ba0-3908-11e2-0800-82414bc10955 from 0 (server-1)121128 4:08:40 [Note] WSREP: STATE EXCHANGE: got state msg: e8a69ba0-3908-11e2-0800-82414bc10955 from 1 (server-2)121128 4:08:40 [Note] WSREP: STATE EXCHANGE: got state msg: e8a69ba0-3908-11e2-0800-82414bc10955 from 2 (garb)121128 4:08:40 [Note] WSREP: Quorum results: version = 2, component = PRIMARY, conf_id = 136, members = 2/3 (joined/total), act_id = 1274, last_appl. = 0, protocols = 0/4/2 (gcs/repl/appl), group UUID = ab735913-3678-11e2-0800-e16a07f68d0f121128 4:08:40 [Note] WSREP: Flow-control interval: [14, 28]121128 4:08:40 [Note] WSREP: New cluster view: global state: ab735913-3678-11e2-0800-e16a07f68d0f:1274, view# 137: Primary, number of nodes: 3, my index: 0, protocol version 2121128 4:08:40 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.121128 4:08:40 [Note] WSREP: Assign initial position for certification: 1274, protocol version: 2121128 4:08:40 [Note] WSREP: Node 2 (garb) requested state transfer from ‘any’. Selected 0 (server-1)(SYNCED) as donor.121128 4:08:40 [Note] WSREP: Shifting SYNCED → DONOR/DESYNCED (TO: 1274)121128 4:08:40 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.121128 4:08:40 [Note] WSREP: Running: 'wsrep_sst_trivial --role ‘donor’ --address ‘’ --auth ‘(null)’ --socket ‘/var/run/mysqld/mysqld.sock’ --datadir ‘/var/lib/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --gtid ‘ab735913-3678-11e2-0800-e16a07f68d0f:1274’'121128 4:08:40 [Note] WSREP: sst_donor_thread signaled with 0sh: wsrep_sst_trivial: command not found121128 4:08:40 [ERROR] WSREP: Failed to read from: wsrep_sst_trivial --role ‘donor’ --address ‘’ --auth ‘(null)’ --socket ‘/var/run/mysqld/mysqld.sock’ --datadir ‘/var/lib/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --gtid 'ab735913-3678-11e2-0800-e16a07f68d0f:1274’121128 4:08:40 [ERROR] WSREP: Process completed with error: wsrep_sst_trivial --role ‘donor’ --address ‘’ --auth ‘(null)’ --socket ‘/var/run/mysqld/mysqld.sock’ --datadir ‘/var/lib/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --gtid ‘ab735913-3678-11e2-0800-e16a07f68d0f:1274’: 2 (No such file or directory)121128 4:08:40 [Warning] WSREP: 0 (server-1): State transfer to 2 (garb) failed: -1 (Operation not permitted)121128 4:08:40 [Note] WSREP: Shifting DONOR/DESYNCED → JOINED (TO: 1274)121128 4:08:40 [Note] WSREP: 2 (garb): State transfer from 0 (server-1) complete.121128 4:08:40 [Note] WSREP: Member 0 (server-1) synced with group.121128 4:08:40 [Note] WSREP: declaring 543d7351-3908-11e2-0800-c2f92b164aeb stable121128 4:08:40 [Note] WSREP: Shifting JOINED → SYNCED (TO: 1274)121128 4:08:40 [Note] WSREP: (386bebb4-3908-11e2-0800-09c99c448bc8, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://192.168.15.5:4567 121128 4:08:40 [Note] WSREP: Synchronized with group, ready for connections121128 4:08:40 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.121128 4:08:41 [Note] WSREP: view(view_id(PRIM,386bebb4-3908-11e2-0800-09c99c448bc8,206) memb { 386bebb4-3908-11e2-0800-09c99c448bc8, 543d7351-3908-11e2-0800-c2f92b164aeb,} joined {} left {} partitioned { e7e5072e-3908-11e2-0800-e0db842b57ec,})121128 4:08:41 [Note] WSREP: forgetting e7e5072e-3908-11e2-0800-e0db842b57ec (tcp://192.168.15.5:4567)121128 4:08:41 [Note] WSREP: (386bebb4-3908-11e2-0800-09c99c448bc8, ‘tcp://0.0.0.0:4567’) turning message relay requesting off121128 4:08:41 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2121128 4:08:41 [Note] WSREP: STATE_EXCHANGE: sent state UUID: e954cbf0-3908-11e2-0800-d76b8ae654a4121128 4:08:41 [Note] WSREP: STATE EXCHANGE: sent state msg: e954cbf0-3908-11e2-0800-d76b8ae654a4121128 4:08:41 [Note] WSREP: STATE EXCHANGE: got state msg: e954cbf0-3908-11e2-0800-d76b8ae654a4 from 0 (server-1)121128 4:08:41 [Note] WSREP: STATE EXCHANGE: got state msg: e954cbf0-3908-11e2-0800-d76b8ae654a4 from 1 (server-2)121128 4:08:41 [Note] WSREP: Quorum results: version = 2, component = PRIMARY, conf_id = 137, members = 2/2 (joined/total), act_id = 1274, last_appl. = 0, protocols = 0/4/2 (gcs/repl/appl), group UUID = ab735913-3678-11e2-0800-e16a07f68d0f121128 4:08:41 [Note] WSREP: Flow-control interval: [12, 23]121128 4:08:41 [Note] WSREP: New cluster view: global state: ab735913-3678-11e2-0800-e16a07f68d0f:1274, view# 138: Primary, number of nodes: 2, my index: 0, protocol version 2121128 4:08:41 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.121128 4:08:41 [Note] WSREP: Assign initial position for certification: 1274, protocol version: 2

There’s a couple of things that I’m wondering about. In the failed log on the server running garbd, I see it sets “sst trivial” by default. This seems to be reflected on server-1, with:

WSREP: Running: ‘wsrep_sst_trivial --role ‘donor’ --address ‘’ --auth ‘(null)’ --socket ‘/var/run/mysqld/mysqld.sock’ --datadir ‘/var/lib/mysql/’ --defaults-file ‘/etc/mysql/my.cnf’ --gtid ‘ab735913-3678-11e2-0800-e16a07f68d0f:1274’’

However, this error also occurs when garbd is able to connect/join.

Another thing, is that when garbd terminates, there’s a 2nd part to the log of server-1, which shows a “partitioned” entry:

Initially:…121128 4:08:38 [Note] WSREP: (386bebb4-3908-11e2-0800-09c99c448bc8, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://192.168.15.5:4567 121128 4:08:39 [Note] WSREP: (386bebb4-3908-11e2-0800-09c99c448bc8, ‘tcp://0.0.0.0:4567’) turning message relay requesting off121128 4:08:39 [Note] WSREP: (386bebb4-3908-11e2-0800-09c99c448bc8, ‘tcp://0.0.0.0:4567’) cleaning up duplicate 0x14697c0 after established 0x7fc2a4849b60121128 4:08:39 [Note] WSREP: declaring 543d7351-3908-11e2-0800-c2f92b164aeb stable121128 4:08:39 [Note] WSREP: declaring e7e5072e-3908-11e2-0800-e0db842b57ec stable121128 4:08:40 [Note] WSREP: view(view_id(PRIM,386bebb4-3908-11e2-0800-09c99c448bc8,205) memb { 386bebb4-3908-11e2-0800-09c99c448bc8, 543d7351-3908-11e2-0800-c2f92b164aeb, e7e5072e-3908-11e2-0800-e0db842b57ec,} joined {} left {} partitioned {})Then again:…121128 4:08:40 [Note] WSREP: (386bebb4-3908-11e2-0800-09c99c448bc8, ‘tcp://0.0.0.0:4567’) turning message relay requesting on, nonlive peers: tcp://192.168.15.5:4567 121128 4:08:40 [Note] WSREP: Synchronized with group, ready for connections121128 4:08:40 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.121128 4:08:41 [Note] WSREP: view(view_id(PRIM,386bebb4-3908-11e2-0800-09c99c448bc8,206) memb { 386bebb4-3908-11e2-0800-09c99c448bc8, 543d7351-3908-11e2-0800-c2f92b164aeb,} joined {} left {} partitioned { e7e5072e-3908-11e2-0800-e0db842b57ec,})

Additional details

wresp portion of my.cnf server-1 (192.168.15.1):

wsrep_cluster_address = gcomm://192.168.15.6server-id = 1log_bin = mysql-binbinlog_format = ROWwsrep_provider = /usr/lib64/libgalera_smm.sowsrep_slave_threads = 2wsrep_cluster_name = my_clusterwsrep_sst_method = xtrabackupwsrep_node_name = server-1

garbd startup:

/usr/bin/garbd --log /var/log/garbd.log --address gcomm://192.168.15.6 --group my_cluster

I wish there was a consistency to the problem, but at times garbd starts fine many times in a row, then it fails either once or many times in a row…

So, to answer myself here: Based on the input of Alexey Yurchenko of Codership, the issue is due to the use of Galera 2.2 r115 - which is currently the one provided on Percona’s Debian main repository. Upgrading to r116 alleviated the problem, which can be found in Percona’s Debian experimental repository (note: Alexey specified r117, actually).