Hello,
I’m trying to set up Percona XtraDB Cluster but the xtrabackup state transfer seems to not be working.
The version I use is from the generic binary, Percona-XtraDB-Cluster-5.5.24-23.6.342.Linux.x86_64.
The system is Linux 3.3.1-gentoo x86_64.
xtrabackup is from the Gentoo’s ebuild repositories:
xtrabackup version 2.0.0 for Percona Server 5.5.16 Linux (x86_64) (revision id: undefined)
innobackupex is from the generic binary since it isn’t installed with xtrabackup in Gentoo:
InnoDB Backup Utility v1.5.1-xtrabackup; Copyright 2003, 2009 Innobase Oyand Percona Inc 2009-2012. All Rights Reserved.
I’ve had to modify the wsrep_sst_xtrabackup script so it calls the Gentoo’s nc like this: nc -l -p .
The Cluster is installed in /opt/mysql and the startup script has the following in it:
export PATH=/opt/mysql/bin:$PATH
so that the wsrep_sst_xtrabackup script is found by mysqld.
So, I have two servers:
web28, 192.168.0.28
and web90, 192.168.0.93.
The config goes like this:
web28:
[mysqld]port = 10001bind = 192.168.0.29wsrep_provider = /opt/mysql/lib/libgalera_smm.sowsrep_provider_options = "gmcast.listen_addr=tcp://192.168.0.29:10101"wsrep_sst_method = xtrabackupwsrep_sst_receive_address = 192.168.0.29:10151innodb_autoinc_lock_mode = 2innodb_locks_unsafe_for_binlog = 1max_binlog_size = 4Mlog-bin = /mysql-bin/galera1/mysql-bin.000000log-bin-index = /mysql-bin/galera1/mysql-bin.indexserver-id = 1binlog_format = ROW
web90:
[mysqld]port = 10002bind = 192.168.0.93wsrep_provider = /opt/mysql/lib/libgalera_smm.sowsrep_cluster_address = "gcomm://192.168.0.29:10101"wsrep_provider_options = “gmcast.listen_addr=tcp://192.168.0.93:10102”;wsrep_sst_method = xtrabackupwsrep_sst_receive_address = 192.168.0.93:10152innodb_autoinc_lock_mode = 2innodb_locks_unsafe_for_binlog = 1max_binlog_size = 4Mlog-bin = /mysql-bin/galera2/mysql-bin.000000log-bin-index = /mysql-bin/galera2/mysql-bin.indexserver-id = 1binlog_format = ROW
On web28, I start one instance and make it a cluster with
SET GLOBAL wsrep_cluster_address=‘gcomm://’
.
On web90, I start the second instance which has
wsrep_cluster_address = “gcomm://192.168.0.29:10101”
in the config file (because I had problems bringing the instance to the cluster in runtime with SET GLOBAL).
The second node seems to plug to the cluster ok. Then, it fires up nc which receives nothing from the first node. And I can’t find any traces of transfer activity on the first node: the wsrep_sst_xtrabackup script is not run, sst.log is not written, even the mysql.err log doesn’t mention the script.
The first node’s error log goes like this:
120710 17:35:37 [Note] WSREP: Read nil XID from storage engines, skipping position init120710 17:35:37 [Note] WSREP: wsrep_load(): loading provider library '/opt/mysql/lib/libgalera_smm.so’120710 17:35:37 [Note] WSREP: wsrep_load(): Galera 2.1(r113) by Codership Oy <info@codership.com> loaded succesfully.120710 17:35:37 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1120710 17:35:37 [Note] WSREP: Reusing existing ‘/var/db/mysql/galera1//galera.cache’.120710 17:35:37 [Note] WSREP: Passing config to GCS: base_host = xx.xxx.xxx.xx; gcache.dir = /var/db/mysql/galera1/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/db/mysql/galera1//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; gmcast.listen_addr = tcp://192.168.0.29:10101; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3120710 17:35:37 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1120710 17:35:37 [Note] Plugin ‘FEDERATED’ is disabled.120710 17:35:37 InnoDB: The InnoDB memory heap is disabled120710 17:35:37 InnoDB: Mutexes and rw_locks use GCC atomic builtins120710 17:35:37 InnoDB: Compressed tables use zlib 1.2.3120710 17:35:37 InnoDB: Using Linux native AIO120710 17:35:37 InnoDB: Initializing buffer pool, size = 2.2G120710 17:35:37 InnoDB: Completed initialization of buffer pool120710 17:35:37 InnoDB: highest supported file format is Barracuda.120710 17:35:37 InnoDB: Waiting for the background threads to start120710 17:35:38 Percona XtraDB (http://www.percona.com) 1.1.8-rel25.3 started; log sequence number 1597938120710 17:35:38 [Note] Recovering after a crash using /mysql-bin/galera1/mysql-bin120710 17:35:38 [Note] Starting crash recovery…120710 17:35:38 [Note] Crash recovery finished.120710 17:35:38 [Note] Server hostname (bind-address): ‘192.168.0.29’; port: 10001120710 17:35:38 [Note] - ‘192.168.0.29’ resolves to ‘192.168.0.29’;120710 17:35:38 [Note] Server socket created on IP: ‘192.168.0.29’.120710 17:35:38 [Note] Event Scheduler: Loaded 0 events120710 17:35:38 [Note] /opt/mysql/bin/mysqld: ready for connections.Version: ‘5.5.24-23.6-log’ socket: ‘/var/db/mysql/galera1/mysql.sock’ port: 10001 Percona XtraDB Cluster (GPL) 5.5.24-23.6, Revision 342, wsrep_23.6.r342120710 17:35:45 [Note] WSREP: Stop replication120710 17:35:47 [Note] WSREP: Start replication120710 17:35:47 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1120710 17:35:47 [Note] WSREP: protonet asio version 0120710 17:35:47 [Note] WSREP: backend: asio120710 17:35:47 [Note] WSREP: GMCast version 0120710 17:35:47 [Note] WSREP: (286ac061-ca94-11e1-0800-4872d6f9a2d8, ‘tcp://192.168.0.29:10101’) listening at tcp://192.168.0.29:10101120710 17:35:47 [Note] WSREP: (286ac061-ca94-11e1-0800-4872d6f9a2d8, ‘tcp://192.168.0.29:10101’) multicast: , ttl: 1120710 17:35:47 [Note] WSREP: EVS version 0120710 17:35:47 [Note] WSREP: PC version 0120710 17:35:47 [Note] WSREP: gcomm: connecting to group ‘my_wsrep_cluster’, peer ''120710 17:35:47 [Note] WSREP: view(view_id(PRIM,286ac061-ca94-11e1-0800-4872d6f9a2d8,1) memb { 286ac061-ca94-11e1-0800-4872d6f9a2d8,} joined {} left {} partitioned {})120710 17:35:47 [Note] WSREP: gcomm: connected120710 17:35:47 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636120710 17:35:47 [Note] WSREP: Shifting CLOSED → OPEN (TO: 0)120710 17:35:47 [Note] WSREP: Opened channel 'my_wsrep_cluster’120710 17:35:47 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1120710 17:35:47 [Note] WSREP: Starting new group from scratch: 286b1abc-ca94-11e1-0800-af44ddd69c3b120710 17:35:47 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 286b28df-ca94-11e1-0800-df69e22a5b35120710 17:35:47 [Note] WSREP: STATE EXCHANGE: sent state msg: 286b28df-ca94-11e1-0800-df69e22a5b35120710 17:35:47 [Note] WSREP: STATE EXCHANGE: got state msg: 286b28df-ca94-11e1-0800-df69e22a5b35 from 0 (web28)120710 17:35:47 [Note] WSREP: Quorum results: version = 2, component = PRIMARY, conf_id = 0, members = 1/1 (joined/total), act_id = 0, last_appl. = -1, protocols = 0/4/2 (gcs/repl/appl), group UUID = 286b1abc-ca94-11e1-0800-af44ddd69c3b120710 17:35:47 [Note] WSREP: Flow-control interval: [8, 16]120710 17:35:47 [Note] WSREP: Restored state OPEN → JOINED (0)120710 17:35:47 [Note] WSREP: Member 0 (web28) synced with group.120710 17:35:47 [Note] WSREP: Shifting JOINED → SYNCED (TO: 0)120710 17:35:47 [Note] WSREP: New cluster view: global state: 286b1abc-ca94-11e1-0800-af44ddd69c3b:0, view# 1: Primary, number of nodes: 1, my index: 0, protocol version 2120710 17:36:15 [Note] WSREP: declaring 38659211-ca94-11e1-0800-4aa997371ac8 stable120710 17:36:15 [Note] WSREP: view(view_id(PRIM,286ac061-ca94-11e1-0800-4872d6f9a2d8,2) memb { 286ac061-ca94-11e1-0800-4872d6f9a2d8, 38659211-ca94-11e1-0800-4aa997371ac8,} joined {} left {} partitioned {})120710 17:36:15 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2120710 17:36:15 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 38ae23a6-ca94-11e1-0800-32a24d285f26120710 17:36:15 [Note] WSREP: STATE EXCHANGE: sent state msg: 38ae23a6-ca94-11e1-0800-32a24d285f26120710 17:36:15 [Note] WSREP: STATE EXCHANGE: got state msg: 38ae23a6-ca94-11e1-0800-32a24d285f26 from 0 (web28)120710 17:36:15 [Note] WSREP: STATE EXCHANGE: got state msg: 38ae23a6-ca94-11e1-0800-32a24d285f26 from 1 (web90)120710 17:36:15 [Note] WSREP: Quorum results: version = 2, component = PRIMARY, conf_id = 1, members = 1/2 (joined/total), act_id = 0, last_appl. = 0, protocols = 0/4/2 (gcs/repl/appl), group UUID = 286b1abc-ca94-11e1-0800-af44ddd69c3b120710 17:36:15 [Note] WSREP: Flow-control interval: [12, 23]120710 17:36:17 [Note] WSREP: Node 1 (web90) requested state transfer from ‘any’. Selected 0 (web28)(SYNCED) as donor.120710 17:36:17 [Note] WSREP: Shifting SYNCED → DONOR/DESYNCED (TO: 0)
And the second’s is like:
120710 17:36:14 [Note] WSREP: Read nil XID from storage engines, skipping position init120710 17:36:14 [Note] WSREP: wsrep_load(): loading provider library '/opt/mysql/lib/libgalera_smm.so’120710 17:36:14 [Note] WSREP: wsrep_load(): Galera 2.1(r113) by Codership Oy <info@codership.com> loaded succesfully.120710 17:36:14 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1120710 17:36:14 [Note] WSREP: Reusing existing ‘/var/db/mysql/galera2//galera.cache’.120710 17:36:14 [Note] WSREP: Passing config to GCS: “gmcast.listen_addr = tcp://192.168.0.93:10102”; base_host = xx.xxx.xxx.xx; gcache.dir = /var/db/mysql/galera2/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/db/mysql/galera2//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3120710 17:36:14 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1120710 17:36:14 [Note] WSREP: wsrep_sst_grab()120710 17:36:14 [Note] WSREP: Start replication120710 17:36:14 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1120710 17:36:14 [Note] WSREP: protonet asio version 0120710 17:36:14 [Note] WSREP: backend: asio120710 17:36:14 [Note] WSREP: GMCast version 0120710 17:36:14 [Note] WSREP: (38659211-ca94-11e1-0800-4aa997371ac8, ‘tcp://0.0.0.0:10101’) listening at tcp://0.0.0.0:10101120710 17:36:14 [Note] WSREP: (38659211-ca94-11e1-0800-4aa997371ac8, ‘tcp://0.0.0.0:10101’) multicast: , ttl: 1120710 17:36:14 [Note] WSREP: EVS version 0120710 17:36:14 [Note] WSREP: PC version 0120710 17:36:14 [Note] WSREP: gcomm: connecting to group ‘my_wsrep_cluster’, peer '192.168.0.29:10101’120710 17:36:15 [Note] WSREP: declaring 286ac061-ca94-11e1-0800-4872d6f9a2d8 stable120710 17:36:15 [Note] WSREP: view(view_id(PRIM,286ac061-ca94-11e1-0800-4872d6f9a2d8,2) memb { 286ac061-ca94-11e1-0800-4872d6f9a2d8, 38659211-ca94-11e1-0800-4aa997371ac8,} joined {} left {} partitioned {})120710 17:36:15 [Note] WSREP: gcomm: connected120710 17:36:15 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636120710 17:36:15 [Note] WSREP: Shifting CLOSED → OPEN (TO: 0)120710 17:36:15 [Note] WSREP: Opened channel 'my_wsrep_cluster’120710 17:36:15 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2120710 17:36:15 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.120710 17:36:15 [Note] WSREP: Waiting for SST to complete.120710 17:36:15 [Note] WSREP: STATE EXCHANGE: sent state msg: 38ae23a6-ca94-11e1-0800-32a24d285f26120710 17:36:15 [Note] WSREP: STATE EXCHANGE: got state msg: 38ae23a6-ca94-11e1-0800-32a24d285f26 from 0 (web28)120710 17:36:15 [Note] WSREP: STATE EXCHANGE: got state msg: 38ae23a6-ca94-11e1-0800-32a24d285f26 from 1 (web90)120710 17:36:15 [Note] WSREP: Quorum results: version = 2, component = PRIMARY, conf_id = 1, members = 1/2 (joined/total), act_id = 0, last_appl. = -1, protocols = 0/4/2 (gcs/repl/appl), group UUID = 286b1abc-ca94-11e1-0800-af44ddd69c3b120710 17:36:15 [Note] WSREP: Flow-control interval: [12, 23]120710 17:36:15 [Note] WSREP: Shifting OPEN → PRIMARY (TO: 0)120710 17:36:15 [Note] WSREP: State transfer required: Group state: 286b1abc-ca94-11e1-0800-af44ddd69c3b:0 Local state: 00000000-0000-0000-0000-000000000000:-1120710 17:36:15 [Note] WSREP: New cluster view: global state: 286b1abc-ca94-11e1-0800-af44ddd69c3b:0, view# 2: Primary, number of nodes: 2, my index: 1, protocol version 2120710 17:36:15 [Warning] WSREP: Gap in state sequence. Need state transfer.120710 17:36:17 [Note] WSREP: Running: 'wsrep_sst_xtrabackup ‘joiner’ ‘192.168.0.93:10152’ ‘’ ‘/var/db/mysql/galera2/’ ‘/etc/mysql/galera2.cnf’ ‘2944’ 2>sst.err’120710 17:36:17 [Note] WSREP: Prepared SST request: xtrabackup|192.168.0.93:10152/xtrabackup_sst120710 17:36:17 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.120710 17:36:17 [Note] WSREP: Assign initial position for certification: 0, protocol version: 2120710 17:36:17 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (286b1abc-ca94-11e1-0800-af44ddd69c3b): 1 (Operation not permitted) at galera/src/replicator_str.cpp:prepare_for_IST():439. IST will be unavailable.120710 17:36:17 [Note] WSREP: Node 1 (web90) requested state transfer from ‘any’. Selected 0 (web28)(SYNCED) as donor.120710 17:36:17 [Note] WSREP: Shifting PRIMARY → JOINER (TO: 0)120710 17:36:17 [Note] WSREP: Requesting state transfer: success, donor: 0
At this point, nc is running on the second node but receiving nothing. I have to kill it with kill -9, and after that, the second mysql instance goes down since the transfer failed. The sst.log on the second node is ok up to this moment.
I must mention that the first mysql instance can’t shutdown cleanly after the failed transfer, I have to kill it with kill -9 too. That’s the reason for the “crash recovery” in the log. Don’t know if this problem is connected to the transfer problem in any way.
If I don’t kill the first instance and start the second one for another time, the result is different:
120710 16:44:20 [Warning] WSREP: Gap in state sequence. Need state transfer.120710 16:44:22 [Note] WSREP: Running: 'wsrep_sst_xtrabackup ‘joiner’ ‘192.168.0.93:10152’ ‘’ ‘/var/db/mysql/galera2/’ ‘/etc/mysql/galera2.cnf’ ‘1517’ 2>sst.err’120710 16:44:22 [Note] WSREP: Prepared SST request: xtrabackup|192.168.0.93:10152/xtrabackup_sst120710 16:44:22 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.120710 16:44:22 [Note] WSREP: Assign initial position for certification: 0, protocol version: 2120710 16:44:22 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (f04716a1-ca8a-11e1-0800-83a47ff7bdc5): 1 (Operation not permitted) at galera/src/replicator_str.cpp:prepare_for_IST():439. IST will be unavailable.120710 16:44:22 [Note] WSREP: Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)120710 17:01:18 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup ‘joiner’ ‘192.168.0.93:10152’ ‘’ ‘/var/db/mysql/galera2/’ ‘/etc/mysql/galera2.cnf’ ‘1517’ 2>sst.err: 32 (Broken pipe)120710 17:01:18 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.120710 17:01:18 [ERROR] WSREP: SST failed: 32 (Broken pipe)120710 17:01:18 [ERROR] Aborting
Don’t know if it is related to nc not receiving the transfer. Would be glad to solve the transfer problem first.