Now that I have Percona XtraDB up and working with CentOS 5.6, I’ve moved on the stage of automating the cluster configuration within our RightScale environment on top of EC2. I need to be able to restore a node from a snapshot and have it rejoin the cluster.
PROBLEM: If I restore from an EBS snapshot, having set wsrep_cluster_address to the address of the other node (in a 2 node cluster) before starting mysql, mysql always fails to start and gives the following error:
Failed to open channel ‘sentry’ at ‘gcomm://sentry2.ourdomain.com’: -110 (Connection timed out)
I can telnet to the address from sentry1, a connection is established, so the error message is no doubt misleading and masking a different issue.
I apologize in advance for the length of the email. I’ve tried to include everything relevant to the configuration.
I’m following the advice presented in this forum topic:
https://groups.google.com/forum/?fromgroups#!topic/codership -team/H1XqY5T8Cgo
- Lock all databases/tables & flush to disk
- Record grastate.dat (wsrep_local_state_uuid, wsrep_last_committed)
- xfs_freeze the filesystem
- execute EBS snapshot
- unfreeze filesystem
- unlink grastate.dat file
- free all database locks
The galarea.cache file has been moved to ephemeral storage not on the EBS volume:
wsrep_provider_options=“gcache.dir=/mnt/mysql-binlogs; gcache.size=2097152000”
It seems that this all works. I’ve confirmed the contents of grastate match the SHOW STATUS before shutdown on an inactive test cluster.
Example of generated grastate.dat:
GALERA saved stateversion: 2.1uuid: d2d0ee82-b5ac-11e1-0800-b938963402d3seqno: 1cert_index:
Here is the configuration I’m using to bootstrap a new galera cluster reference node:
[mysqld]wsrep_provider=/usr/lib/libgalera_smm.sowsrep_provider_options="gcache.dir=/mnt/mysql-binlogs; gcache.size=2097152000"wsrep_cluster_address=gcomm:// wsrep_slave_threads=2wsrep_cluster_name=sentrywsrep_node_address=sentry1.ourdomain.comwsrep_sst_method=rsyncwsrep_node_name=sentry1binlog_format=ROWinnodb_locks_unsafe_for_binlog=1innodb_autoinc_lock_mode=2
Here is the configuration I’m using on the second node:
[mysqld]wsrep_provider=/usr/lib/libgalera_smm.sowsrep_provider_options=“gcache.dir=/mnt/mysql-binlogs; gcache.size=2097152000”#wsrep_cluster_address=gcomm://sentry1.ourdomain.com wsrep_slave_threads=2wsrep_cluster_name=sentrywsrep_node_address=sentry1.ourdomain.comwsrep_sst_method=rsyncwsrep_node_name=sentry2binlog_format=ROWinnodb_locks_unsafe_for_binlog=1innodb_autoinc_lock_mode=2
At this point, the cluster is working fine.
| wsrep_cluster_size | 2 || wsrep_cluster_status | Primary || wsrep_connected | ON || wsrep_ready | ON |
To restore the “sentry1” node after I launch a new instance to replace it, I replace the wsrep_cluster_address with “gcomm://sentry2.ourdomain.com” in the mysql configuration before starting the server. This is where I run into problems probably as a result of a lack of deep understanding of the Galera clustering.
Error log below from sentry1 after a relaunch using EBS snapshot:
120613 16:27:41 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql120613 16:27:41 [Note] Flashcache bypass: disabled120613 16:27:41 [Note] Flashcache setup error is : ioctl failed120613 16:27:41 [Note] WSREP: Read nil XID from storage engines, skipping position init120613 16:27:41 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so’120613 16:27:41 [Note] WSREP: wsrep_load(): Galera 2.1dev(r112) by Codership Oy <info@codership.com> loaded succesfully.120613 16:27:41 [Note] WSREP: Found saved state: d2d0ee82-b5ac-11e1-0800-b938963402d3:1120613 16:27:41 [Note] WSREP: Preallocating 2097153312/2097153312 bytes in ‘/mnt/mysql-binlogs/galera.cache’…120613 16:28:47 [Note] WSREP: Passing config to GCS: base_host = sentry1.ourdomain.com; gcache.dir = /mnt/mysql-binlogs; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /mnt/mysql-binlogs/galera.cache; gcache.page_size = 128M; gcache.size = 2097152000; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 2147483647; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3120613 16:28:48 [Note] WSREP: Assign initial position for certification: 1, protocol version: -1120613 16:28:48 [Note] WSREP: wsrep_sst_grab()120613 16:28:48 [Note] WSREP: Start replication120613 16:28:48 [Note] WSREP: Setting initial position to d2d0ee82-b5ac-11e1-0800-b938963402d3:1120613 16:28:48 [Note] WSREP: protonet asio version 0120613 16:28:48 [Note] WSREP: backend: asio120613 16:28:48 [Note] WSREP: GMCast version 0120613 16:28:48 [Note] WSREP: (86f87d10-b5af-11e1-0800-f871301c8bc4, ‘tcp://0.0.0.0:4567’) listening at tcp://0.0.0.0:4567120613 16:28:48 [Note] WSREP: (86f87d10-b5af-11e1-0800-f871301c8bc4, ‘tcp://0.0.0.0:4567’) multicast: , ttl: 1120613 16:28:48 [Note] WSREP: EVS version 0120613 16:28:48 [Note] WSREP: PC version 0120613 16:28:48 [Note] WSREP: gcomm: connecting to group ‘sentry’, peer 'sentry2.ourdomain.com:'120613 16:28:50 [Note] WSREP: declaring bfd074ba-b5ac-11e1-0800-bb10122d629d stable120613 16:28:50 [Note] WSREP: view(view_id(NON_PRIM,86f87d10-b5af-11e1-0800-f871301c8bc4,5) memb { 86f87d10-b5af-11e1-0800-f871301c8bc4, bfd074ba-b5ac-11e1-0800-bb10122d629d,} joined {} left {} partitioned { d2c537e0-b5ac-11e1-0800-ddbf4b2d9ba6,})120613 16:28:50 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check120613 16:29:20 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) at gcomm/src/pc.cpp:connect():148120613 16:29:20 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: Failed to open backend connection: -110 (Connection timed out)120613 16:29:20 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel ‘sentry’ at ‘gcomm://sentry2.ourdomain.com’: -110 (Connection timed out)120613 16:29:20 [ERROR] WSREP: gcs connect failed: Connection timed out120613 16:29:20 [ERROR] WSREP: wsrep::connect() failed: 6120613 16:29:20 [ERROR] Aborting120613 16:29:20 [Note] WSREP: Service disconnected.120613 16:29:21 [Note] WSREP: Some threads may fail to exit.120613 16:29:21 [Note] /usr/sbin/mysqld: Shutdown complete120613 16:29:21 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
Here I show that there are no problems for sentry1 to connect to sentry2:
telnet sentry2.ourdomain.com 4567Trying 10.201.3.218…Connected to sentry2.ourdomain.com (10.201.3.218).Escape character is ‘^]’.$???t???-b?jFI???]
Meanwhile, on sentry2, I see the following in the logs:
120613 16:34:39 [Note] WSREP: (bfd074ba-b5ac-11e1-0800-bb10122d629d, ‘tcp://0.0.0.0:4567’) reconnecting to 86f87d10-b5af-11e1-0800-f871301c8bc4 (tcp://10.206.246.66:4567), attempt 240120613 16:35:18 [Note] WSREP: (bfd074ba-b5ac-11e1-0800-bb10122d629d, ‘tcp://0.0.0.0:4567’) reconnecting to 86f87d10-b5af-11e1-0800-f871301c8bc4 (tcp://10.206.246.66:4567), attempt 270120613 16:35:57 [Note] WSREP: (bfd074ba-b5ac-11e1-0800-bb10122d629d, ‘tcp://0.0.0.0:4567’) reconnecting to 86f87d10-b5af-11e1-0800-f871301c8bc4 (tcp://10.206.246.66:4567), attempt 300
This makes sense, however, since sentry1 will not start.
Software packages:
Percona-XtraDB-Cluster-shared-5.5.23-23.5.333.rhel5Percona-XtraDB-Cluster-server-5.5.23-23.5.333.rhel5Percona-XtraDB-Cluster-devel-5.5.23-23.5.333.rhel5Percona-XtraDB-Cluster-client-5.5.23-23.5.333.rhel5Percona-XtraDB-Cluster-galera-2.0-1.112.rhel5libstdc++44-devel-4.4.4-13.el5gcc-c+±4.1.2-50.el5gcc44-c+±4.4.4-13.el5libstdc+±4.1.2-50.el5libstdc+±devel-4.1.2-50.el5gcc-objc+±4.1.2-50.el5compat-libstdc+±296-2.96-138
Any help is much appreciated!
Thanks,
Erik Osterman