Background:
We have a five-server Percona Server group, using ROW-based binlog replication to keep four standby servers up to date with changes on our master. In the process of chasing down a bug, we noticed that our replicas had fallen out of sync with the master somewhere along the line, and used pt-table-checksum and pt-table-sync to bring them back into sync.
There are a number of databases whose names follow the pattern “liveNN” for some number. In each database is a table named “user”, with identical structure but differing data. These tables all have the schema
Create Table: CREATE TABLE `user` (
`userid` int(11) NOT NULL AUTO_INCREMENT,
`systemid` int(11) NOT NULL,
`username` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`password` varbinary(50) DEFAULT NULL,
`level` smallint(6) NOT NULL DEFAULT '0',
`fname` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`lname` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`last_login` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`num_logins` int(11) NOT NULL DEFAULT '0',
`signup_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`email` varchar(100) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`ownerid` int(11) NOT NULL DEFAULT '0',
`organization` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`p_street` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`p_city` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`p_province` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`p_country` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`p_code` varchar(13) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`bus_phone` varchar(25) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`mob_phone` varchar(25) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`fax` varchar(25) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`s_street` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`s_city` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`s_province` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`s_country` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`s_code` varchar(13) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`active` int(1) NOT NULL DEFAULT '1',
`setup` tinyint(4) unsigned DEFAULT '0',
`notified` tinyint(3) unsigned DEFAULT '0',
`p_street2` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`s_street2` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`home_phone` varchar(25) COLLATE utf8_unicode_ci DEFAULT NULL,
`rate` decimal(25,8) DEFAULT NULL,
`note` mediumtext COLLATE utf8_unicode_ci,
`support_order` tinyint(4) NOT NULL DEFAULT '0',
`sort_department` varchar(30) COLLATE utf8_unicode_ci DEFAULT '',
`individual` tinyint(4) NOT NULL DEFAULT '0',
`archive` tinyint(4) DEFAULT '0',
`sort_gmail` varchar(30) COLLATE utf8_unicode_ci DEFAULT '',
`pref_email` tinyint(4) DEFAULT '1',
`pref_gmail` tinyint(4) DEFAULT '0',
`timesheetDate` varchar(4) COLLATE utf8_unicode_ci DEFAULT '2',
`ext_systemid` int(11) DEFAULT NULL,
`sort_contractors` varchar(30) COLLATE utf8_unicode_ci DEFAULT NULL,
`mapped_taskid` int(11) DEFAULT NULL,
`updated` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`api_token` char(32) COLLATE utf8_unicode_ci DEFAULT NULL,
`language` varchar(12) COLLATE utf8_unicode_ci NOT NULL DEFAULT 'en',
`currency_code` char(3) COLLATE utf8_unicode_ci DEFAULT NULL,
`vat_name` varchar(30) COLLATE utf8_unicode_ci DEFAULT NULL,
`vat_number` varchar(30) COLLATE utf8_unicode_ci DEFAULT NULL,
`allow_late_notifications` tinyint(1) NOT NULL DEFAULT '1',
`allow_late_fees` tinyint(1) NOT NULL DEFAULT '1',
`company_size` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`company_industry` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`systemid`,`userid`),
UNIQUE KEY `username` (`systemid`,`username`),
KEY `level_idx` (`systemid`,`level`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
(Yes, yes, MyISAM, I know. There’s a plan for that. Bear with me.)
The weirdness:
After running
pt-table-sync \
--tables 'user' \
--databases 'live,live15,live16,live17,live18,live19,live2,live21,live3,live5,live6,live7,live8,live9' \
--user 'percona-toolkit' \
--replicate 'percona.checksums' \
--check-master \
--check-slave \
--wait 5 \
--execute \
--verbose \
--sync-to-master \
h=172.30.30.5
to bring the “user” table back into sync on one replica, we ran
pt-table-checksum \
--quiet \
--nocheck-replication-filters \
--no-check-binlog-format \
--tables 'user' \
--user percona-toolkit \
--host db-master
three times to validate that the affected tables were in sync. The results were, shall we say, surprising:
TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
06-14T13:17:48 0 2 220253 8 0 5.868 live22.user
TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
06-14T13:47:28 0 2 141113 6 0 9.036 live24.user
TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
06-14T14:13:36 0 2 522900 17 0 12.887 live9.user
The question:
What the heck is going on here? Why are tables becoming out of sync, and then becoming in sync again without further action? Will using InnoDB help, leaving aside all of the other very good reasons to switch ASAP?