Only able to start first node in cluster

Hello,

I’m having a bit of trouble with my Percona XtraDB cluster. I’m not quite sure what started this as it was working just fine. I had to reboot one of the servers (there are 3 nodes) and it wouldn’t start again. Then I restarted a second and it too wouldn’t restart. Currently the only one that runs is the one I bootscrap by using the command /etc/init.d/mysql bootstrap-pxc. Below are entries from the log file and other details that I hope will help shed some light on this issue.

Some details that might help:

  • rsync is the same version on all servers, v3.0.6
  • Percona versin install from rpm: Percona-XtraDB-Cluster-shared-55-5.5.34-25.9.607

Can anyone assite me in troubleshooting this?

P.S.
Yep, I will certainly change the password after I get this working again :slight_smile:

Details:

mysql.cnf contents on all servers:
[mysqld]
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_cluster_address=gcomm://192.168.10.31,192.168.10.32,192.168.10.39
wsrep_slave_threads=8
wsrep_sst_method=rsync
binlog_format=ROW
default_storage_engine=InnoDB
innodb_locks_unsafe_for_binlog=1
innodb_autoinc_lock_mode=2
wsrep_sst_auth=sstuser:s3cret
wsrep_replicate_myisam=on
log-error=/var/log/mysqld.log

From the client log that won’t start:
131206 04:20:54 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
131206 04:20:54 mysqld_safe WSREP: Running position recovery with --log_error=’/var/lib/mysql/wsrep_recovery.7Y8asL’ --pid-file=’/var/lib/mysql/mail3.devdigital.com-recover.pid’
131206 04:20:57 mysqld_safe WSREP: Recovered position 00000000-0000-0000-0000-000000000000:-1
131206 4:20:57 [Note] WSREP: wsrep_start_position var submitted: ‘00000000-0000-0000-0000-000000000000:-1’
131206 4:20:57 [Note] WSREP: Read nil XID from storage engines, skipping position init
131206 4:20:57 [Note] WSREP: wsrep_load(): loading provider library ‘/usr/lib/libgalera_smm.so’
131206 4:20:57 [Note] WSREP: wsrep_load(): Galera 2.8(r165) by Codership Oy <info@codership.com> loaded successfully.
131206 4:20:57 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
131206 4:20:57 [Note] WSREP: Reusing existing ‘/var/lib/mysql//galera.cache’.
131206 4:20:57 [Note] WSREP: Passing config to GCS: base_host = 192.168.10.39; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 2147483647; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
131206 4:20:57 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
131206 4:20:57 [Note] WSREP: wsrep_sst_grab()
131206 4:20:57 [Note] WSREP: Start replication
131206 4:20:57 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
131206 4:20:57 [Note] WSREP: protonet asio version 0
131206 4:20:57 [Note] WSREP: backend: asio
131206 4:20:57 [Note] WSREP: GMCast version 0
131206 4:20:57 [Note] WSREP: (1880d3ce-5e60-11e3-bf16-efb486db5fe4, ‘tcp://0.0.0.0:4567’) listening at tcp://0.0.0.0:4567
131206 4:20:57 [Note] WSREP: (1880d3ce-5e60-11e3-bf16-efb486db5fe4, ‘tcp://0.0.0.0:4567’) multicast: , ttl: 1
131206 4:20:57 [Note] WSREP: EVS version 0
131206 4:20:57 [Note] WSREP: PC version 0
131206 4:20:57 [Note] WSREP: gcomm: connecting to group ‘my_wsrep_cluster’, peer ‘192.168.10.31:,192.168.10.32:,192.168.10.39:’
131206 4:20:57 [Warning] WSREP: (1880d3ce-5e60-11e3-bf16-efb486db5fe4, ‘tcp://0.0.0.0:4567’) address ‘tcp://192.168.10.39:4567’ points to own listening address, blacklisting
131206 4:20:57 [Note] WSREP: (1880d3ce-5e60-11e3-bf16-efb486db5fe4, ‘tcp://0.0.0.0:4567’) address ‘tcp://192.168.10.39:4567’ pointing to uuid 1880d3ce-5e60-11e3-bf16-efb486db5fe4 is blacklisted, skipping
131206 4:20:57 [Note] WSREP: declaring 20d04133-5e60-11e3-8215-6772cc3565ab stable
131206 4:20:57 [Note] WSREP: Node 20d04133-5e60-11e3-8215-6772cc3565ab state prim
131206 4:20:57 [Note] WSREP: view(view_id(PRIM,1880d3ce-5e60-11e3-bf16-efb486db5fe4,2) memb {
1880d3ce-5e60-11e3-bf16-efb486db5fe4,
20d04133-5e60-11e3-8215-6772cc3565ab,
} joined {
} left {
} partitioned {
})
131206 4:20:57 [Note] WSREP: discarding pending addr without UUID: tcp://192.168.10.32:4567
131206 4:20:58 [Note] WSREP: gcomm: connected
131206 4:20:58 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
131206 4:20:58 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
131206 4:20:58 [Note] WSREP: Opened channel ‘my_wsrep_cluster’
131206 4:20:58 [Note] WSREP: Waiting for SST to complete.
131206 4:20:58 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
131206 4:20:58 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 191adee8-5e60-11e3-bc41-7ee12e5406fe
131206 4:20:58 [Note] WSREP: STATE EXCHANGE: sent state msg: 191adee8-5e60-11e3-bc41-7ee12e5406fe
131206 4:20:58 [Note] WSREP: STATE EXCHANGE: got state msg: 191adee8-5e60-11e3-bc41-7ee12e5406fe from 0 (mail3.devdigital.com)
131206 4:20:58 [Note] WSREP: STATE EXCHANGE: got state msg: 191adee8-5e60-11e3-bc41-7ee12e5406fe from 1 (mail1.devdigital.com)
131206 4:20:58 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 1,
members = 1/2 (joined/total),
act_id = 47556,
last_appl. = -1,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = 6ebd7a54-51dd-11e3-bb4c-12d4c65d64a6
131206 4:20:58 [Note] WSREP: Flow-control interval: [23, 23]
131206 4:20:58 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 47556)
131206 4:20:58 [Note] WSREP: State transfer required:
Group state: 6ebd7a54-51dd-11e3-bb4c-12d4c65d64a6:47556
Local state: 00000000-0000-0000-0000-000000000000:-1
131206 4:20:58 [Note] WSREP: New cluster view: global state: 6ebd7a54-51dd-11e3-bb4c-12d4c65d64a6:47556, view# 2: Primary, number of nodes: 2, my index: 0, protocol version 2
131206 4:20:58 [Warning] WSREP: Gap in state sequence. Need state transfer.
131206 4:21:00 [Note] WSREP: Running: ‘wsrep_sst_rsync --role ‘joiner’ --address ‘192.168.10.39’ --auth ‘sstuser:s3cret’ --datadir ‘/var/lib/mysql/’ --defaults-file ‘/etc/my.cnf’ --parent ‘11723’’
131206 4:21:00 [Note] WSREP: Prepared SST request: rsync|192.168.10.39:4444/rsync_sst
131206 4:21:00 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
131206 4:21:00 [Note] WSREP: Assign initial position for certification: 47556, protocol version: 2
131206 4:21:00 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (6ebd7a54-51dd-11e3-bb4c-12d4c65d64a6): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():445. IST will be unavailable.
131206 4:21:00 [Note] WSREP: Node 0 (mail3.devdigital.com) requested state transfer from ‘any’. Selected 1 (mail1.devdigital.com)(SYNCED) as donor.
131206 4:21:00 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 47556)
131206 4:21:00 [Note] WSREP: Requesting state transfer: success, donor: 1
131206 4:22:03 [Warning] WSREP: 1 (mail1.devdigital.com): State transfer to 0 (mail3.devdigital.com) failed: -1 (Operation not permitted)
131206 4:22:03 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():719: Will never receive state. Need to abort.
131206 4:22:03 [Note] WSREP: gcomm: terminating thread
131206 4:22:03 [Note] WSREP: gcomm: joining thread
131206 4:22:03 [Note] WSREP: gcomm: closing backend
131206 4:22:04 [Note] WSREP: view(view_id(NON_PRIM,1880d3ce-5e60-11e3-bf16-efb486db5fe4,2) memb {
1880d3ce-5e60-11e3-bf16-efb486db5fe4,
} joined {
} left {
} partitioned {
20d04133-5e60-11e3-8215-6772cc3565ab,
})
131206 4:22:04 [Note] WSREP: view((empty))
131206 4:22:04 [Note] WSREP: gcomm: closed
131206 4:22:04 [Note] WSREP: /usr/sbin/mysqld: Terminated.
131206 04:22:04 mysqld_safe mysqld from pid file /var/lib/mysql/mail3.devdigital.com.pid ended
WSREP_SST: [ERROR] Parent mysqld process (PID:11723) terminated unexpectedly. (20131206 04:22:04.988)
WSREP_SST: [INFO] Joiner cleanup. (20131206 04:22:04.992)
WSREP_SST: [INFO] Joiner cleanup done. (20131206 04:22:05.506)

While bootstrapping keep gcomm address empty(later you can change it). From the above log the state UUID is [COLOR=#252C2F] 00000000-0000-0000-0000-000000000000, that means its trying SST not IST.
Try to up other nodes, and check what logs is coming.

[COLOR=#252C2F]