Managing a MySQL Fabric setup requires making new replicants fairly frequently. As I set up for production with a 300G database, mysqldump is not a very practical method for creating replicants because, even on 20 cores & 96GB RAM, it takes basically for ever - about 16 hours - to make a new replicant (not to mention the additional time that gets tacked on to that while the new slave catches up to a master that has been updating aggressively for those 16 hours). On the other hand, if I use innobackupex snapshots to create replicants and then add them to the Fabric group, it only takes me about 1 hour total. Pretty sweet.
The problem I’m having is that when the databases are configured for Fabric (which mostly means with sync-binlog = 1 and GTIDs for everything), I can only create a new slave by taking an innobackupex snapshot from the master. It seems to me that slave snapshots should be functionally identical, and are more desirable for obvious practical reasons.
So here, for jrivera, who generously offered to have a look at the problem when I posted about it on Reddit, are some details:
- All Percona servers in this configuration are Linodes of varying sizes set up as Ubuntu 14.04LTS with Percona 72.1 R 0503478.
- they all run the same my.cnf (enclosed) with the innodb buffer pool changed to match the particular server, and the server id changed.
- I’m taking all snapshots using the same script, enclosed.
- When I set up a new server, I simply scp -rp the entire snapshot into a clean /var/lib/mysql, then chown -R mysql:mysql /var/lib/mysql, start the server, and once running I reset master on the new slave and then set gtid_purged to the appropriate value (from xtrabackup_info).
Then, I go to my Fabric controller and add the server. If the snapshot was taken form a slave, I get error 1872: Slave failed to initialize relay log info structure from the repository. On the fabric controller, this error is reported as:
ServerError: Error trying to configure server (0773b58e-f44e-11e4-8143-f23c91185b4c) as slave: Command (START SLAVE , ()) failed accessing (mynewslave:3306). 1872 (HY000): Slave failed to initialize relay log info structure from the repository…
So naturally I try logging on to mynewslave from the Fabric controller with the fabric mysql account, and it works fine. Then I go to mynewslave, repeat the reset master & set grid_purged operation, restart the slave, locally, and get the exact same error.
By contrast, if I do the exact same operation, except taking the innobackupex snapshot from the master, the whole procedure goes off without a hitch:
$mysqlfabric group add mygroup mynewslave
Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1
uuid finished success result
52602f60-eae6-42c6-b89a-48377124c75a 1 1 1
state success when description
3 2 1430962750.62 Triggered by <mysql.fabric.events.Event object at 0x7f8037995fd0>.
4 2 1430962750.62 Executing action (_add_server).
5 2 1430962751.72 Executed action (_add_server).
Oh - and, in case it helps, when the new slave fails, this is what its status looks like:
mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: myoldmaster
Master_User: fabric
Master_Port: 3306
Connect_Retry: 60
Master_Log_File:
Read_Master_Log_Pos: 4
Relay_Log_File: relay-bin.000010
Relay_Log_Pos: 603177759
Relay_Master_Log_File:
Slave_IO_Running: No
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1872
Last_Error: Slave failed to initialize relay log info structure from the repository
Skip_Counter: 0
Exec_Master_Log_Pos: 0
Relay_Log_Space: 0
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1872
Last_SQL_Error: Slave failed to initialize relay log info structure from the repository
Replicate_Ignore_Server_Ids:
Master_Server_Id: 0
Master_UUID: f43a8a6b-c136-11e4-b41c-f23c9133c356
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 150506 19:18:58
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set: f43a8a6b-c136-11e4-b41c-f23c9133c356:1-150207153
Auto_Position: 1
===============
The attachment tool (paper clip) does not appear to be working in Safari 7.5.1 under OSX 10.9.5, so here’s the relevant configurations in-line:
#========= my.cnf:=========#
[mysql]
CLIENT
port = 3306
socket = /var/lib/mysql/mysql.sock
[mysqld]
GENERAL
user = mysql
default-storage-engine = InnoDB
socket = /var/lib/mysql/mysql.sock
pid-file = /var/lib/mysql/mysql.pid
MyISAM
key-buffer-size = 32M
myisam-recover = FORCE,BACKUP
SAFETY
max-allowed-packet = 16M
max-connect-errors = 1000000
sysdate-is-now = 1
innodb = FORCE
DATA STORAGE
datadir = /var/lib/mysql/
BINARY LOGGING
log-bin = /var/lib/mysql/mysql-bin
expire-logs-days = 14
sync-binlog = 1
REPLICATION
server-id = 33
read-only = 0
#skip-slave-start = 1
log-slave-updates = 1
relay-log = /var/lib/mysql/relay-bin
slave-net-timeout = 60
sync-master-info = 1
sync-relay-log = 1
sync-relay-log-info = 1
#binlog-do-db = serpsalpha
binlog-format = ‘ROW’
#Fabric:
#skip-slave-start (=0), binlog-do-db (empty), & binlog-format (ROW) above are changed for Fabric.
#additionally, sync-master-info must be on for Fabric, and the following:
gtid-mode = ‘ON’
enforce-gtid-consistency = ‘ON’
simplified_binlog_gtid_recovery = ‘ON’ #Note: this variable has been changing name a lot. It can fail under special circumstances. See details at http://v.gd/2XcOhf
master-info-repository = TABLE
relay-log-info-repository = TABLE
#report-host= might want to set this for fabric
#report-port #no need to set, as it’s reported correctly as of 5.5 or 5.6…
CACHES AND LIMITS
tmp-table-size = 32M
max-heap-table-size = 32M
query-cache-type = 0
query-cache-size = 0
max-connections = 500
thread-cache-size = 50
open-files-limit = 65535
table-definition-cache = 1024
table-open-cache = 55
INNODB
innodb-flush-method = O_DIRECT
innodb-log-files-in-group = 2
innodb-log-file-size = 512M
innodb-flush-log-at-trx-commit = 2
innodb-file-per-table = 1
innodb-buffer-pool-size = 14G
innodb-rollback-on-timeout = ON
transaction-isolation = READ-COMMITTED
low-priority-updates = 1
LOGGING
log-error = /var/log/mysql/mysql-error.log
log-queries-not-using-indexes = 0
slow-query-log = 1
slow-query-log-file = /var/log/mysql/mysql-slow.log
long-query-time = 10
#========= snapshot.sh:=========#
#!/bin/bash
if [ $SHLVL -eq 1 ]
then
echo “Please run $0 from within a screen… it just makes sense.” >&2
exit 1
fi
cd /var/backups/mysql/daily_snapshot
echo
echo “Please enter mysql root password:”
read -s p
echo
mysql -h 127.0.0.1 -u root -p$p -ABNe “select NULL;” 2>/dev/null >/dev/null
if [ $? -ne 0 ]
then
echo “Wrong root password.” >&2
exit 2
fi
echo
echo “Starting at $(date)”
#snapshot
SECONDS=0
sdate=$(date +%Y-%m-%d_%H)
date | tee -a /var/log/daily_snapshot_$sdate.log
innobackupex --host=127.0.0.1 --user=root --password=$p
–rsync --safe-slave-backup --slave-info /var/backups/mysql/daily_snapshot
2>&1 | egrep “innobackupex|xtrabackup” | tee -a /var/log/daily_snapshot_$sdate.log
sdir=$(ls -1 | egrep “^$sdate”)
innobackupex --host=127.0.0.1 --user=root --password=$p
–apply-log /var/backups/mysql/daily_snapshot/$( ls -1 /var/backups/mysql/daily_snapshot/ | grep $sdir)
2>&1 | egrep “innobackupex|xtrabackup” | tee -a /var/log/daily_snapshot_$sdate.log
echo “daily snapshot complete at $(date) after $SECONDS sec” | tee -a /var/log/daily_snapshot_$sdate.log