pt-table-checksum complains replica is stopped when replica is not.

Hiya,

I’m having some problems with pt-table-checksum where its reporting that the replica server is stopped when actually the replica server is running as far as I can tell.

The following is the version of toolkit being used

[root@master]# rpm -qa | grep percona-toolkit

percona-toolkit-3.0.2-1.el7.x86_64

[root@master]# pt-table-checksum --version

pt-table-checksum 3.0.2

MariaDB version for both master and slave

[root@master]# rpm -qa | grep -i MariaDB-server

MariaDB-server-10.2.5-1.el7.centos.x86_64

The following is the config file for the tool on the master:

[root@master]# cat /etc/percona-toolkit/pt-table-checksum.conf

binary-index

no-check-binlog-format

recursion-method = processlist

replicate = percona.epa_sum

#replicate-check = TRUE

#replicate-check-only = TRUE

user = slave1

password = <slave1_password>

host = 127.0.0.1

ignore-databases = mysql,information_schema,performance_schema

databases = epa

engines = InnoDB

tables = Persons

The following is from the master server when executing the pt-table-checksum tool:

[root@master ~]# pt-table-checksum

Replica zasperdump.androgogic.local is stopped. Waiting.

Replica zasperdump.androgogic.local is stopped. Waiting.

Replica zasperdump.androgogic.local is stopped. Waiting.

Replica zasperdump.androgogic.local is stopped. Waiting.

Replica zasperdump.androgogic.local is stopped. Waiting.

:

^C# Caught SIGINT.

TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE

05-23T13:40:33 0 1 2 1 0 278.173 epa.Persons

The “Replica zasperdump.androgogic.local is stopped. Waiting.” server repeats itself indefinitely until I break the process.

The following is the slave status for the master instance where the pt-table-checksum program was executed.

MariaDB [(none)]> show slave ‘master2’ status\G;

*************************** 1. row ***************************

Slave_IO_State: Waiting for master to send event

Master_Host: 192.168.82.23

Master_User: slave1

Master_Port: 3306

Connect_Retry: 10

Master_Log_File: master2-bin.000010

Read_Master_Log_Pos: 3235061

Relay_Log_File: mariadb-relay-bin-master2.000002

Relay_Log_Pos: 2359544

Relay_Master_Log_File: master2-bin.000010

Slave_IO_Running: Yes

Slave_SQL_Running: Yes

Replicate_Do_DB: epa,percona

Replicate_Ignore_DB:

Replicate_Do_Table:

Replicate_Ignore_Table:

Replicate_Wild_Do_Table: epa.%,percona.%

Replicate_Wild_Ignore_Table:

Last_Errno: 0

Last_Error:

Skip_Counter: 0

Exec_Master_Log_Pos: 3235061

Relay_Log_Space: 2359863

Until_Condition: None

Until_Log_File:

Until_Log_Pos: 0

Master_SSL_Allowed: No

Master_SSL_CA_File:

Master_SSL_CA_Path:

Master_SSL_Cert:

Master_SSL_Cipher:

Master_SSL_Key:

Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

Last_IO_Errno: 0

Last_IO_Error:

Last_SQL_Errno: 0

Last_SQL_Error:

Replicate_Ignore_Server_Ids:

Master_Server_Id: 2

Master_SSL_Crl:

Master_SSL_Crlpath:

Using_Gtid: Current_Pos

Gtid_IO_Pos: 101-1-7772,0-1-7,102-2-20215,30-3-7953

Replicate_Do_Domain_Ids: 102

Replicate_Ignore_Domain_Ids:

Parallel_Mode: conservative

SQL_Delay: 0

SQL_Remaining_Delay: NULL

Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

1 row in set (0.00 sec)

As it may be obvious, the master is a test instance with no traffic on it and from what I’m able to discern there is no apparent lag on the slave as well.

From the master server:

MariaDB [(none)]> show master status;

±-------------------±---------±-------------±-----------------+

| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |

±-------------------±---------±-------------±-----------------+

| master2-bin.000010 | 3235061 | | |

±-------------------±---------±-------------±-----------------+

1 row in set (0.00 sec)

MariaDB [(none)]> select binlog_gtid_pos(‘master2-bin.000010’,3235061);

±----------------------------------------------+

| binlog_gtid_pos(‘master2-bin.000010’,3235061) |

±----------------------------------------------+

| 102-2-20215 |

±----------------------------------------------+

1 row in set (0.00 sec)

The Gtid_IO_Pos seems to be perfectly matched on both the master and the slave.

After a bit of digging into the /bin/pt-table-checksum code I found the below code where I think it was failing at:

my @lagged_slaves = map { {cxn=>$_, lag=>undef} } @$slaves;

while ( $oktorun->() && @lagged_slaves ) {

PTDEBUG && _d(‘Checking slave lag’);

for my $i ( 0…$#lagged_slaves ) {

my $lag = $get_lag->($lagged_slaves[$i]->{cxn});

PTDEBUG && _d($lagged_slaves[$i]->{cxn}->name(),

‘slave lag:’, $lag);

if ( !defined $lag || $lag > $max_lag ) {

$lagged_slaves[$i]->{lag} = $lag;

}

else {

delete $lagged_slaves[$i];

}

}

If I changed the first line from

my @lagged_slaves = map { {cxn=>$_, lag=>undef} } @$slaves;

to

my @lagged_slaves = ();

the program immediately works and returns the expected results.

I’m not sure how the program is determining the slave lag but I suspect its missing something and hence throwing the “Replica zasperdump.androgogic.local is stopped. Waiting.”

Any assistance you can provide to get pt-table-checksum to work properly on my setup without the code hack is deeply appreciated.

Thanks in advance.

Hi,

I am running into the same issue. Have you found the fix for this yet.

Thanks.