Pt-online-schema-change changes IP of my host

Hello!

I am using pt-online-schema-change on an AWS Aurora 2.x mysql 5.7 database trying to rebuild a table. For my staging database it seems to work, but for some reason when I run the following command against my production database, it seems to partially work, then fail with an error like this:

# trying to do a table repair
> pt-online-schema-change \
        --alter "FORCE" \
        D=$DATABASE,t=$TABLE \
        --host=$HOST \
        --user=$USER \
        --password=$PASSWORD \
        --execute

Cannot connect to h=10.XXX.XXX.XXX,p=...,u=awsmaster: DBI connect(';host=10.XXX.XXX.XXX;mysql_read_default_group=client','awsmaster',...) failed: Can't connect to MySQL server on '10.XXX.XXX.XXX' (111) at /usr/bin/pt-online-schema-change line 2345.

What’s strange is, the IP doesn’t match to my host. And even when I set “–host” to a hard-coded IP that I know is my database, the initial table create works (which requires a connection), but it fails at the row-copy step. Here’s more complete output of this operation, run from a bastion host that has privs to my database:

Running pt-online-schema-change on myschema.mytable
Cannot connect to h=10.XXX.XXX.XXX,p=...,u=awsmaster: DBI connect(';host=10.XXX.XXX.XXX;mysql_read_default_group=client','awsmaster',...) failed: Can't connect to MySQL server on '10.XXX.XXX.XXX' (111) at /usr/bin/pt-online-schema-change line 2345.

No slaves found.  See --recursion-method if host ip-XXX-XXX-XXX-XXX has slaves.
Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
Operation, tries, wait:
  analyze_table, 10, 1
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Altering `myschema`.`mytable`...
Creating new table...
Created new table myschema._mytable_new OK.
Altering new table...
Altered `myschema`.`_mytable_new` OK.
2023-08-16T20:32:57 Creating triggers...
2023-08-16T20:32:57 Created triggers OK.
2023-08-16T20:32:57 Copying approximately 3205 rows...
Cannot connect to h=10.XXX.XXX.XXX,p=...,u=awsmaster: DBI connect(';host=10.XXX.XXX.XXX;mysql_read_default_group=client','awsmaster',...) failed: Can't connect to MySQL server on '10.XXX.XXX.XXX' (111) at /usr/bin/pt-online-schema-change line 2345.

2023-08-16T20:32:57 Dropping triggers...
2023-08-16T20:32:57 Dropped triggers OK.
2023-08-16T20:32:57 Dropping new table...
2023-08-16T20:32:57 Dropped new table OK.
`myschema`.`mytable` was not altered.
        (in cleanup) 2023-08-16T20:32:57 Error copying rows from `myschema`.`mytable` to `myschema`.`_mytable_new`: Threads_running=18446744073665823305 exceeds its critical threshold 200
2023-08-16T20:32:57 Dropping triggers...
2023-08-16T20:32:57 Dropped triggers OK.
`myschema`.`mytable` was not altered.
Completed myschema.mytable
All tables processed!

my version of percona tools

$ pt-online-schema-change --version
pt-online-schema-change 3.5.4

Has anyone else run into this?

Try running pt-osc with PTDEBUG=1 before it to produce debug output and see what is happening in more detail.

Thanks so much for your response! Really useful to know about the debug flag!

So this is the command I ran

PTDEBUG=1 pt-online-schema-change --alter "FORCE" D=my_schema,t=test_ken  --host=my_host.cluster-mycluserid.my-aws-region.rds.amazonaws.com --user=awsmaster --password=my_password --execute

Here are some highlights from the output

L504 - just noting this as one of the last places where I see domain name before IPs start getting used
#  Connection info: my_host_name.cluster-myclusterid.my-aws-region.rds.amazonaws.com via TCP/IP Character set info: $VAR1 = [

L540 - 1st instance of using an AWS IP-based hostname instead of orig domain name
# Cxn:3993 17044 DBI::db=HASH(0x2458b60) hostname: ip-XXX-XX-X-69 154374235
# Cxn:3979 17044 DBI::db=HASH(0x2458b60) Connected dbh to ip-XXX-XX-X-69 h=my_host_name.cluster-myclusterid.my-aws-region.rds.amazonaws.com

L661 - finds "slaves", fails to connect, but this seems ok?
# MasterSlave:4389 17044 Found 2 slaves
# MasterSlave:4366 17044 Recursing from D=my_schema,h=my_host_name.cluster-myclusterid.my-aws-region.rds.amazonaws.com,p=...,t=test_ken,u=awsmaster to h=XX.XXX.XX.134,p=...,u=awsmaster
# MasterSlave:4300 17044 Recursion methods: processlist hosts
# DSNParser:2294 17044 DBI:mysql:;host=XX.XXX.XX.134;mysql_read_default_group=client
# DSNParser:2343 17044 DBI:mysql:;host=XX.XXX.XX.134;mysql_read_default_group=client   awsmaster   my_password mysql_enable_utf8=>0, ShowErrorStatement=>1, AutoCommit=>1, RaiseError=>1, PrintError=>0
# DSNParser:2343 17044 DBI:mysql:;host=XX.XXX.XX.134;mysql_read_default_group=client   awsmaster   my_password mysql_enable_utf8=>0, ShowErrorStatement=>1, AutoCommit=>1, RaiseError=>1, PrintError=>0
Cannot connect to h=XX.XXX.XX.134,p=...,u=awsmaster: DBI connect(';host=XX.XXX.XX.134;mysql_read_default_group=client','awsmaster',...) failed: Can't connect to MySQL server on 'XX.XXX.XX.134' (111) at /usr/bin/pt-online-schema-change line 2345.
...
Cannot connect to h=XX.XXX.XX.134,p=...,u=awsmaster: DBI connect(';host=XX.XXX.XX.134;mysql_read_default_group=client','awsmaster',...) failed: Can't connect to MySQL server on 'XX.XXX.XX.134' (111) at /usr/bin/pt-online-schema-change line 2345.
No slaves found.  See --recursion-method if host ip-XXX-XX-X-69 has slaves.

L778 - so the original connection works
Created new table my_schema._test_ken_new OK.

L823 - creating triggers works too
2023-08-18T15:44:50 Created triggers OK.

L882 - now slaves are found by processlist
# MasterSlave:4384 17044 Finding slaves with _find_slaves_by_processlist
# MasterSlave:4455 17044 DBI::db=HASH(0x2458b60) SHOW GRANTS FOR CURRENT_USER()
# MasterSlave:4485 17044 DBI::db=HASH(0x2458b60) SHOW FULL PROCESSLIST
# DSNParser:2181 17044 Parsing h=XX.XXX.XX.134

L921
# MasterSlave:4389 17044 Found 2 slaves

L927 - but this time, this failure seems to surface to output and this appears to trigger failure
Cannot connect to h=XX.XXX.XX.134,p=...,u=awsmaster: DBI connect(';host=XX.XXX.XX.134;mysql_read_default_group=client','awsmaster',...) failed: Can't connect to MySQL server on 'XX.XXX.XX.134' (111) at /usr/bin/pt-online-schema-change line 2345.

# MasterSlave:4366 17044 Recursing from D=my_schema,h=my_host_name.cluster-myclusterid.my-aws-region.rds.amazonaws.com,p=...,t=test_ken,u=awsmaster to h=XX.XXX.XX.134,p=...,u=awsmaster
# MasterSlave:4300 17044 Recursion methods: processlist hosts
# DSNParser:2294 17044 DBI:mysql:;host=XX.XXX.XX.134;mysql_read_default_group=client
# DSNParser:2343 17044 DBI:mysql:;host=XX.XXX.XX.134;mysql_read_default_group=client   awsmaster   my_password mysql_enable_utf8=>0, ShowErrorStatement=>1, AutoCommit=>1, RaiseError=>1, PrintError=>0
# DSNParser:2343 17044 DBI:mysql:;host=XX.XXX.XX.134;mysql_read_default_group=client   awsmaster   my_password mysql_enable_utf8=>0, ShowErrorStatement=>1, AutoCommit=>1, RaiseError=>1, PrintError=>0
Cannot connect to h=XX.XXX.XX.134,p=...,u=awsmaster: DBI connect(';host=XX.XXX.XX.134;mysql_read_default_group=client','awsmaster',...) failed: Can't connect to MySQL server on 'XX.XXX.XX.134' (111) at /usr/bin/pt-online-schema-change line 2345.

# ReplicaLagWaiter:5144 17044 All slaves caught up

L940 -- cleanup starts somewhere around here
# CleanupTask:6721 17044 Calling cleanup task CODE(0x25839b8

it seems like dealing with slaves causes failure. I’m wondering why slave activity is needed at all? In our case, this is AWS Aurora, so I’d have expected slave activity is managed by the mysql + managed aurora configuration? Not sure if I’m interpreting this correctly.

I should note… if i simply hard-code my host name around L2345 (the line the error references above), things work.

(also i tried running with --skip-check-slave-lag and --nocheck-replication-filters, but get same connection errors)

Most Percona Toolkit tools will attempt to monitor slave lag and will pause their activity if lag gets too high. None of the tools directly execute SQL onto slaves.

You need to use --recursion-method=NONE to disable this entirely.

Most Percona Toolkit tools will attempt to monitor slave lag and will pause their activity if lag gets too high.

Thanks again, matthewb!
--recursion-method=NONE is a good tip. Thank you.

If I figure out why the slaves are not reachable, I’ll update this thread.