Hiya,
I’m having some problems with pt-table-checksum where its reporting that the replica server is stopped when actually the replica server is running as far as I can tell.
The following is the version of toolkit being used
[root@master]# rpm -qa | grep percona-toolkit
percona-toolkit-3.0.2-1.el7.x86_64
[root@master]# pt-table-checksum --version
pt-table-checksum 3.0.2
MariaDB version for both master and slave
[root@master]# rpm -qa | grep -i MariaDB-server
MariaDB-server-10.2.5-1.el7.centos.x86_64
The following is the config file for the tool on the master:
[root@master]# cat /etc/percona-toolkit/pt-table-checksum.conf
binary-index
no-check-binlog-format
recursion-method = processlist
replicate = percona.epa_sum
#replicate-check = TRUE
#replicate-check-only = TRUE
user = slave1
password = <slave1_password>
host = 127.0.0.1
ignore-databases = mysql,information_schema,performance_schema
databases = epa
engines = InnoDB
tables = Persons
The following is from the master server when executing the pt-table-checksum tool:
[root@master ~]# pt-table-checksum
Replica zasperdump.androgogic.local is stopped. Waiting.
Replica zasperdump.androgogic.local is stopped. Waiting.
Replica zasperdump.androgogic.local is stopped. Waiting.
Replica zasperdump.androgogic.local is stopped. Waiting.
Replica zasperdump.androgogic.local is stopped. Waiting.
:
^C# Caught SIGINT.
TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
05-23T13:40:33 0 1 2 1 0 278.173 epa.Persons
The “Replica zasperdump.androgogic.local is stopped. Waiting.” server repeats itself indefinitely until I break the process.
The following is the slave status for the master instance where the pt-table-checksum program was executed.
MariaDB [(none)]> show slave ‘master2’ status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.82.23
Master_User: slave1
Master_Port: 3306
Connect_Retry: 10
Master_Log_File: master2-bin.000010
Read_Master_Log_Pos: 3235061
Relay_Log_File: mariadb-relay-bin-master2.000002
Relay_Log_Pos: 2359544
Relay_Master_Log_File: master2-bin.000010
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB: epa,percona
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table: epa.%,percona.%
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 3235061
Relay_Log_Space: 2359863
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 2
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Current_Pos
Gtid_IO_Pos: 101-1-7772,0-1-7,102-2-20215,30-3-7953
Replicate_Do_Domain_Ids: 102
Replicate_Ignore_Domain_Ids:
Parallel_Mode: conservative
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
1 row in set (0.00 sec)
As it may be obvious, the master is a test instance with no traffic on it and from what I’m able to discern there is no apparent lag on the slave as well.
From the master server:
MariaDB [(none)]> show master status;
±-------------------±---------±-------------±-----------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
±-------------------±---------±-------------±-----------------+
| master2-bin.000010 | 3235061 | | |
±-------------------±---------±-------------±-----------------+
1 row in set (0.00 sec)
MariaDB [(none)]> select binlog_gtid_pos(‘master2-bin.000010’,3235061);
±----------------------------------------------+
| binlog_gtid_pos(‘master2-bin.000010’,3235061) |
±----------------------------------------------+
| 102-2-20215 |
±----------------------------------------------+
1 row in set (0.00 sec)
The Gtid_IO_Pos seems to be perfectly matched on both the master and the slave.
After a bit of digging into the /bin/pt-table-checksum code I found the below code where I think it was failing at:
my @lagged_slaves = map { {cxn=>$_, lag=>undef} } @$slaves;
while ( $oktorun->() && @lagged_slaves ) {
PTDEBUG && _d(‘Checking slave lag’);
for my $i ( 0…$#lagged_slaves ) {
my $lag = $get_lag->($lagged_slaves[$i]->{cxn});
PTDEBUG && _d($lagged_slaves[$i]->{cxn}->name(),
‘slave lag:’, $lag);
if ( !defined $lag || $lag > $max_lag ) {
$lagged_slaves[$i]->{lag} = $lag;
}
else {
delete $lagged_slaves[$i];
}
}
If I changed the first line from
my @lagged_slaves = map { {cxn=>$_, lag=>undef} } @$slaves;
to
my @lagged_slaves = ();
the program immediately works and returns the expected results.
I’m not sure how the program is determining the slave lag but I suspect its missing something and hence throwing the “Replica zasperdump.androgogic.local is stopped. Waiting.”
Any assistance you can provide to get pt-table-checksum to work properly on my setup without the code hack is deeply appreciated.
Thanks in advance.