Trouble Automating Snapshots & Restoration (EC2+RightScale)

Now that I have Percona XtraDB up and working with CentOS 5.6, I’ve moved on the stage of automating the cluster configuration within our RightScale environment on top of EC2. I need to be able to restore a node from a snapshot and have it rejoin the cluster.

PROBLEM: If I restore from an EBS snapshot, having set wsrep_cluster_address to the address of the other node (in a 2 node cluster) before starting mysql, mysql always fails to start and gives the following error:

Failed to open channel ‘sentry’ at ‘gcomm://sentry2.ourdomain.com’: -110 (Connection timed out)

I can telnet to the address from sentry1, a connection is established, so the error message is no doubt misleading and masking a different issue.

I apologize in advance for the length of the email. I’ve tried to include everything relevant to the configuration.

I’m following the advice presented in this forum topic:
https://groups.google.com/forum/?fromgroups#!topic/codership -team/H1XqY5T8Cgo

  1. Lock all databases/tables & flush to disk
  2. Record grastate.dat (wsrep_local_state_uuid, wsrep_last_committed)
  3. xfs_freeze the filesystem
  4. execute EBS snapshot
  5. unfreeze filesystem
  6. unlink grastate.dat file
  7. free all database locks

The galarea.cache file has been moved to ephemeral storage not on the EBS volume:
wsrep_provider_options=“gcache.dir=/mnt/mysql-binlogs; gcache.size=2097152000”

It seems that this all works. I’ve confirmed the contents of grastate match the SHOW STATUS before shutdown on an inactive test cluster.

Example of generated grastate.dat:

GALERA saved stateversion: 2.1uuid: d2d0ee82-b5ac-11e1-0800-b938963402d3seqno: 1cert_index:

Here is the configuration I’m using to bootstrap a new galera cluster reference node:

[mysqld]wsrep_provider=/usr/lib/libgalera_smm.sowsrep_provider_options="gcache.dir=/mnt/mysql-binlogs; gcache.size=2097152000"wsrep_cluster_address=gcomm:// wsrep_slave_threads=2wsrep_cluster_name=sentrywsrep_node_address=sentry1.ourdomain.comwsrep_sst_method=rsyncwsrep_node_name=sentry1binlog_format=ROWinnodb_locks_unsafe_for_binlog=1innodb_autoinc_lock_mode=2

Here is the configuration I’m using on the second node:

[mysqld]wsrep_provider=/usr/lib/libgalera_smm.sowsrep_provider_options=“gcache.dir=/mnt/mysql-binlogs; gcache.size=2097152000”#wsrep_cluster_address=gcomm://sentry1.ourdomain.com wsrep_slave_threads=2wsrep_cluster_name=sentrywsrep_node_address=sentry1.ourdomain.comwsrep_sst_method=rsyncwsrep_node_name=sentry2binlog_format=ROWinnodb_locks_unsafe_for_binlog=1innodb_autoinc_lock_mode=2

At this point, the cluster is working fine.

| wsrep_cluster_size | 2 || wsrep_cluster_status | Primary || wsrep_connected | ON || wsrep_ready | ON |

To restore the “sentry1” node after I launch a new instance to replace it, I replace the wsrep_cluster_address with “gcomm://sentry2.ourdomain.com” in the mysql configuration before starting the server. This is where I run into problems probably as a result of a lack of deep understanding of the Galera clustering.

Error log below from sentry1 after a relaunch using EBS snapshot:

120613 16:27:41 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql120613 16:27:41 [Note] Flashcache bypass: disabled120613 16:27:41 [Note] Flashcache setup error is : ioctl failed120613 16:27:41 [Note] WSREP: Read nil XID from storage engines, skipping position init120613 16:27:41 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so’120613 16:27:41 [Note] WSREP: wsrep_load(): Galera 2.1dev(r112) by Codership Oy <info@codership.com> loaded succesfully.120613 16:27:41 [Note] WSREP: Found saved state: d2d0ee82-b5ac-11e1-0800-b938963402d3:1120613 16:27:41 [Note] WSREP: Preallocating 2097153312/2097153312 bytes in ‘/mnt/mysql-binlogs/galera.cache’…120613 16:28:47 [Note] WSREP: Passing config to GCS: base_host = sentry1.ourdomain.com; gcache.dir = /mnt/mysql-binlogs; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /mnt/mysql-binlogs/galera.cache; gcache.page_size = 128M; gcache.size = 2097152000; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 2147483647; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3120613 16:28:48 [Note] WSREP: Assign initial position for certification: 1, protocol version: -1120613 16:28:48 [Note] WSREP: wsrep_sst_grab()120613 16:28:48 [Note] WSREP: Start replication120613 16:28:48 [Note] WSREP: Setting initial position to d2d0ee82-b5ac-11e1-0800-b938963402d3:1120613 16:28:48 [Note] WSREP: protonet asio version 0120613 16:28:48 [Note] WSREP: backend: asio120613 16:28:48 [Note] WSREP: GMCast version 0120613 16:28:48 [Note] WSREP: (86f87d10-b5af-11e1-0800-f871301c8bc4, ‘tcp://0.0.0.0:4567’) listening at tcp://0.0.0.0:4567120613 16:28:48 [Note] WSREP: (86f87d10-b5af-11e1-0800-f871301c8bc4, ‘tcp://0.0.0.0:4567’) multicast: , ttl: 1120613 16:28:48 [Note] WSREP: EVS version 0120613 16:28:48 [Note] WSREP: PC version 0120613 16:28:48 [Note] WSREP: gcomm: connecting to group ‘sentry’, peer 'sentry2.ourdomain.com:'120613 16:28:50 [Note] WSREP: declaring bfd074ba-b5ac-11e1-0800-bb10122d629d stable120613 16:28:50 [Note] WSREP: view(view_id(NON_PRIM,86f87d10-b5af-11e1-0800-f871301c8bc4,5) memb { 86f87d10-b5af-11e1-0800-f871301c8bc4, bfd074ba-b5ac-11e1-0800-bb10122d629d,} joined {} left {} partitioned { d2c537e0-b5ac-11e1-0800-ddbf4b2d9ba6,})120613 16:28:50 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check120613 16:29:20 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) at gcomm/src/pc.cpp:connect():148120613 16:29:20 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():195: Failed to open backend connection: -110 (Connection timed out)120613 16:29:20 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1290: Failed to open channel ‘sentry’ at ‘gcomm://sentry2.ourdomain.com’: -110 (Connection timed out)120613 16:29:20 [ERROR] WSREP: gcs connect failed: Connection timed out120613 16:29:20 [ERROR] WSREP: wsrep::connect() failed: 6120613 16:29:20 [ERROR] Aborting120613 16:29:20 [Note] WSREP: Service disconnected.120613 16:29:21 [Note] WSREP: Some threads may fail to exit.120613 16:29:21 [Note] /usr/sbin/mysqld: Shutdown complete120613 16:29:21 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

Here I show that there are no problems for sentry1 to connect to sentry2:

telnet sentry2.ourdomain.com 4567Trying 10.201.3.218…Connected to sentry2.ourdomain.com (10.201.3.218).Escape character is ‘^]’.$???t???-b?jFI???]

Meanwhile, on sentry2, I see the following in the logs:

120613 16:34:39 [Note] WSREP: (bfd074ba-b5ac-11e1-0800-bb10122d629d, ‘tcp://0.0.0.0:4567’) reconnecting to 86f87d10-b5af-11e1-0800-f871301c8bc4 (tcp://10.206.246.66:4567), attempt 240120613 16:35:18 [Note] WSREP: (bfd074ba-b5ac-11e1-0800-bb10122d629d, ‘tcp://0.0.0.0:4567’) reconnecting to 86f87d10-b5af-11e1-0800-f871301c8bc4 (tcp://10.206.246.66:4567), attempt 270120613 16:35:57 [Note] WSREP: (bfd074ba-b5ac-11e1-0800-bb10122d629d, ‘tcp://0.0.0.0:4567’) reconnecting to 86f87d10-b5af-11e1-0800-f871301c8bc4 (tcp://10.206.246.66:4567), attempt 300

This makes sense, however, since sentry1 will not start.

Software packages:

Percona-XtraDB-Cluster-shared-5.5.23-23.5.333.rhel5Percona-XtraDB-Cluster-server-5.5.23-23.5.333.rhel5Percona-XtraDB-Cluster-devel-5.5.23-23.5.333.rhel5Percona-XtraDB-Cluster-client-5.5.23-23.5.333.rhel5Percona-XtraDB-Cluster-galera-2.0-1.112.rhel5libstdc++44-devel-4.4.4-13.el5gcc-c+±4.1.2-50.el5gcc44-c+±4.4.4-13.el5libstdc+±4.1.2-50.el5libstdc+±devel-4.1.2-50.el5gcc-objc+±4.1.2-50.el5compat-libstdc+±296-2.96-138

Any help is much appreciated!

Thanks,

Erik Osterman

I should also note, that through the process of relaunching, “sentry1” will obtain a new IP address. When shutting down “sentry1” to relaunch, nothing more than a “service mysqld stop” is executed. Nothing is done to leave the cluster in an orderly fashion.

-Erik

Hi Erik,

I think you have configured everything right (except wsrep_node_address, which is the same on both nodes) and the error you’re getting is clearly unrelated to configuration. Do you have a firewall configured on either of the nodes? Does centry2 correctly resolve centry1 IP after centry1 is restarted?

BTW, “service mysqld stop” IS leaving the cluster in an orderly fashion.