Not the answer you need?
Register and ask your own question!

pt-table-checksum complains replica is stopped when replica is not.

jzcjzc EntrantCurrent User Role Beginner
Hiya,




I'm having some problems with pt-table-checksum where its reporting that the replica server is stopped when actually the replica server is running as far as I can tell.




The following is the version of toolkit being used

[[email protected]]# rpm -qa | grep percona-toolkit

percona-toolkit-3.0.2-1.el7.x86_64



[[email protected]]# pt-table-checksum --version

pt-table-checksum 3.0.2




MariaDB version for both master and slave

[[email protected]]# rpm -qa | grep -i MariaDB-server

MariaDB-server-10.2.5-1.el7.centos.x86_64







The following is the config file for the tool on the master:




[[email protected]]# cat /etc/percona-toolkit/pt-table-checksum.conf

binary-index

no-check-binlog-format

recursion-method = processlist

replicate = percona.epa_sum

#replicate-check = TRUE

#replicate-check-only = TRUE

user = slave1

password = <slave1_password>

host = 127.0.0.1

ignore-databases = mysql,information_schema,performance_schema

databases = epa

engines = InnoDB

tables = Persons




The following is from the master server when executing the pt-table-checksum tool:




[[email protected] ~]# pt-table-checksum

Replica zasperdump.androgogic.local is stopped. Waiting.

Replica zasperdump.androgogic.local is stopped. Waiting.

Replica zasperdump.androgogic.local is stopped. Waiting.

Replica zasperdump.androgogic.local is stopped. Waiting.

Replica zasperdump.androgogic.local is stopped. Waiting.

:

^C# Caught SIGINT.

TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE

05-23T13:40:33 0 1 2 1 0 278.173 epa.Persons




The "Replica zasperdump.androgogic.local is stopped. Waiting." server repeats itself indefinitely until I break the process.




The following is the slave status for the master instance where the pt-table-checksum program was executed.




MariaDB [(none)]> show slave 'master2' status\G;

*************************** 1. row ***************************

Slave_IO_State: Waiting for master to send event

Master_Host: 192.168.82.23

Master_User: slave1

Master_Port: 3306

Connect_Retry: 10

Master_Log_File: master2-bin.000010

Read_Master_Log_Pos: 3235061

Relay_Log_File: mariadb-relay-bin-master2.000002

Relay_Log_Pos: 2359544

Relay_Master_Log_File: master2-bin.000010

Slave_IO_Running: Yes

Slave_SQL_Running: Yes

Replicate_Do_DB: epa,percona

Replicate_Ignore_DB:

Replicate_Do_Table:

Replicate_Ignore_Table:

Replicate_Wild_Do_Table: epa.%,percona.%

Replicate_Wild_Ignore_Table:

Last_Errno: 0

Last_Error:

Skip_Counter: 0

Exec_Master_Log_Pos: 3235061

Relay_Log_Space: 2359863

Until_Condition: None

Until_Log_File:

Until_Log_Pos: 0

Master_SSL_Allowed: No

Master_SSL_CA_File:

Master_SSL_CA_Path:

Master_SSL_Cert:

Master_SSL_Cipher:

Master_SSL_Key:

Seconds_Behind_Master: 0

Master_SSL_Verify_Server_Cert: No

Last_IO_Errno: 0

Last_IO_Error:

Last_SQL_Errno: 0

Last_SQL_Error:

Replicate_Ignore_Server_Ids:

Master_Server_Id: 2

Master_SSL_Crl:

Master_SSL_Crlpath:

Using_Gtid: Current_Pos

Gtid_IO_Pos: 101-1-7772,0-1-7,102-2-20215,30-3-7953

Replicate_Do_Domain_Ids: 102

Replicate_Ignore_Domain_Ids:

Parallel_Mode: conservative

SQL_Delay: 0

SQL_Remaining_Delay: NULL

Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

1 row in set (0.00 sec)




As it may be obvious, the master is a test instance with no traffic on it and from what I'm able to discern there is no apparent lag on the slave as well.




From the master server:




MariaDB [(none)]> show master status;

+
+
+
+
+

| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |

+
+
+
+
+

| master2-bin.000010 | 3235061 | | |

+
+
+
+
+

1 row in set (0.00 sec)




MariaDB [(none)]> select binlog_gtid_pos('master2-bin.000010',3235061);

+
+

| binlog_gtid_pos('master2-bin.000010',3235061) |

+
+

| 102-2-20215 |

+
+

1 row in set (0.00 sec)




The Gtid_IO_Pos seems to be perfectly matched on both the master and the slave.




After a bit of digging into the /bin/pt-table-checksum code I found the below code where I think it was failing at:




my @lagged_slaves = map { {cxn=>$_, lag=>undef} } @$slaves;

while ( $oktorun->() && @lagged_slaves ) {

PTDEBUG && _d('Checking slave lag');

for my $i ( 0..$#lagged_slaves ) {

my $lag = $get_lag->($lagged_slaves[$i]->{cxn});

PTDEBUG && _d($lagged_slaves[$i]->{cxn}->name(),

'slave lag:', $lag);

if ( !defined $lag || $lag > $max_lag ) {

$lagged_slaves[$i]->{lag} = $lag;

}

else {

delete $lagged_slaves[$i];

}

}




If I changed the first line from

my @lagged_slaves = map { {cxn=>$_, lag=>undef} } @$slaves;

to

my @lagged_slaves = ();

the program immediately works and returns the expected results.




I'm not sure how the program is determining the slave lag but I suspect its missing something and hence throwing the "Replica zasperdump.androgogic.local is stopped. Waiting."




Any assistance you can provide to get pt-table-checksum to work properly on my setup without the code hack is deeply appreciated.




Thanks in advance.

Comments

  • adityakamanaadityakamana Entrant Current User Role Beginner
    Hi,

    I am running into the same issue. Have you found the fix for this yet.

    Thanks.
Sign In or Register to comment.

MySQL, InnoDB, MariaDB and MongoDB are trademarks of their respective owners.
Copyright ©2005 - 2020 Percona LLC. All rights reserved.