Using pt-osc(online schema change) in Multi-Master(Active/Passive) environment

MySQL version : Percona Server for MySQL 5.7.40-43-log We have Master-Master(Active-Passive) Replication setup in our cluster. Only one Master(Active) is receiving the requests for writes while other one is Passive. Each Master then have 4 Replicas to it. There is bi-directional replica going on between Active and Passive Master.

Now, I need to perform an ALTER to a huge table that has ~600M rows and the table size is around : 250G. All the replicas including the Masters(Since they are replicas for each other anyways) have the following flag for replication : Replicate_Wild_Do_Table as data%.% so I am not concerned about the _new, _old table creation while actual execution of the pt-osc because all the tables(%) are included in replication.

Before running the pt-osc on the production, I would like to test it on a staging cluster so I have created a cluster like A → B → C where A is like my Active Master, B is my Passive Master and C is my Replica pointing to B. Please note that I have not set up Master-Master replication between A ↔ B. My main concern is with the replicas that are present in the production cluster as in while doing it on the Master, there shouldn’t be any issues with the replication and the replicas should also have the updated change to the table.

I have used the following flags in the pt-osc statement but I am getting error with recurse & recursion-method

pt-online-schema-change --alter “ADD COLUMN Cost decimal(18,9) NOT NULL DEFAULT ‘0.000000000’”
–alter-foreign-keys-method=auto
–check-slave-lag=50
–max-lag=50
–recurse
–recursion-method=processlist
–host=replica-015
–max-load=Threads_running=30
–critical-load=Threads_running=50
–chunk-time=0.5
–slave-user=repl
–slave-password=‘xxxxxxx’
–chunk-size=10000
–progress percentage,1
–dry-run
–user=test
–password=‘xxxxxxx’
D=avon,t=excel

I tried processlist, hosts but that didn’t worked and I got the following error :

I updated the pt-osc command and used --recurse=1 and --recursion-method=processlist and the message is gone but I don’t completely get it as to what recurse or recursion-method did here since in the out I don’t see any slaves detected vs in some other output I saw online that says no slave detected.

I not completely sure about the DSN table which is mentioned in the percona documentation : pt-online-schema-change — Percona Toolkit Documentation

Can someone help me here in understanding the replication related flag in pt-osc command and how should I proceed with the --dry-run or --execute in the staging cluster environment that I have created.

Thank you in advance.

Percona Toolkit version : percona-toolkit-3.2.0-1.el7.x86_64
MySQL version : Percona-Server-server-57-5.7.40-43.1.el7.x86_64

I have noticed that the --dry-run won’t show you the replica detection part(not sure if it is a bug in this version). I have created dsns table and added entries for the replicas.

I am facing another issue now, I can see that the tool has found 2 slaves but it won’t proceed ahead and gives another error :

I checked the source code and I see :

  • For Perl, I have the following message from the Master Server : Package perl-DBD-MySQL-4.023-6.el7.x86_64 already installed and latest version
  • For Character Set I have the following in the database :
    utf8 UTF-8 Unicode utf8_general_ci 3
  • Connection string does have Host, Database and Port :
    pt-online-schema-change --alter “ADD COLUMN sample decimal(18,9) NOT NULL DEFAULT ‘0.000000000’” \

–alter-foreign-keys-method=auto
D=avon,t=excel
–host=va-db-replica-015
–port=3306
–max-load=Threads_running=30
–check-slave-lag=50
–recursion-method=dsn=D=avon,t=dsns
–max-lag=50
–critical-load=Threads_running=50
–chunk-size=10000
–progress percentage,1
–execute
–user=ptosc
–password=‘xxxxx’

  • User does have required privileges.

Still not sure why am I getting this : Cannot connect to MySQL: DBI connect(‘;host=50;port=3306;mysql_read_default_group=client’,‘ptosc’,…) failed: Can’t connect to MySQL server on ‘50’ (110) at /bin/pt-online-schema-change line 2345.

Show full processlist from the Master does show the replica :
9 repl 10.xx.xx.x:47654 Binlog Dump 13738 Master has sent all binlog to slave; waiting for more updates 13737974 0 0

Show slave hosts does shows the replica:
121 va-db-staging-002@xxx 3306 577 bf3ab07b-a2f1-11ef-8145-002590fdeb10

Can someone please help in debugging this please? How do I fix this : Cannot connect to MySQL: DBI connect(‘;host=50;port=3306;mysql_read_default_group=client’,‘ptosc’,…) failed: Can’t connect to MySQL server on ‘50’ (110) at /bin/pt-online-schema-change line 2345.

Thank you in advance.

Fixed this issue by assigning the correct value for --check-slave-lag option. The check-slave-lag should have the hostname of the replicas where you want to monitor the lag.

However, there is an issue when you pass multiple replica to --check-slave-lag option. For example, I am using the following options :
–recursion-method=dsn=D=avon,t=dsns --check-slave-lag=db1,db2 --max-lag=50 and the output shows :
Found 2 slaves:
db1 → 10.xx.xx.x:3306
db2 → 10.xx.xx.xx:3306
Will check slave lag on:
db2 → 10.xx.xx.xx:3306

This is again an issue with pt-osc. I have mentioned 2 hosts on the --check-slave-lag option and the tool will check slave lag on only one.

Does anyone have any suggestion as to why is this behavior with pt-osc? Do I need to check --recurse option but that is used for the hierarchy( Number of levels to recurse in the hierarchy when discovering replicas.).

How will the pt-osc monitor the slave lag if check-slave-lag option has multiple servers to check lags.

Thank you in advance.

--check-slave-lag

This option overrides the normal behavior of finding and continually monitoring replication lag on ALL connected replicas. If you don’t want to monitor ALL replicas, but you want more than just one replica to be monitored, then use the DSN option to the --recursion-method option instead of this option.

What you are experiencing is precisely documented behavior.

You need to use the DSN table. Create the table then insert additional DSNs into the table for each replica you want to monitor.

INSERT INTO dsnstable (dsn) VALUES ("h=replica1"), ("h=replica2,u=user,p=pass")

I am using the DSN table method only that’s why I was able to see the replicas listed. I do have both the replicas entry in the dsns table.

As you can see above, I have added 2 replicas entry in the dsns table.

The issue is that the tool is monitoring slave lag on only 1 vs 2 as I mentioned in my previous post. How will the pt-osc monitor the slave lag if check-slave-lag option has multiple servers to check lags.

Also, do I need to use any other option to monitor replication lag apart from --max-lag because I purposedly stopped the replication to increase the lag and started the replication and the seconds behind master was ~180 seconds but still the pt-osc was running without pause or any message.