Thanks so much for your response! Really useful to know about the debug flag!
So this is the command I ran
PTDEBUG=1 pt-online-schema-change --alter "FORCE" D=my_schema,t=test_ken --host=my_host.cluster-mycluserid.my-aws-region.rds.amazonaws.com --user=awsmaster --password=my_password --execute
Here are some highlights from the output
L504 - just noting this as one of the last places where I see domain name before IPs start getting used
# Connection info: my_host_name.cluster-myclusterid.my-aws-region.rds.amazonaws.com via TCP/IP Character set info: $VAR1 = [
L540 - 1st instance of using an AWS IP-based hostname instead of orig domain name
# Cxn:3993 17044 DBI::db=HASH(0x2458b60) hostname: ip-XXX-XX-X-69 154374235
# Cxn:3979 17044 DBI::db=HASH(0x2458b60) Connected dbh to ip-XXX-XX-X-69 h=my_host_name.cluster-myclusterid.my-aws-region.rds.amazonaws.com
L661 - finds "slaves", fails to connect, but this seems ok?
# MasterSlave:4389 17044 Found 2 slaves
# MasterSlave:4366 17044 Recursing from D=my_schema,h=my_host_name.cluster-myclusterid.my-aws-region.rds.amazonaws.com,p=...,t=test_ken,u=awsmaster to h=XX.XXX.XX.134,p=...,u=awsmaster
# MasterSlave:4300 17044 Recursion methods: processlist hosts
# DSNParser:2294 17044 DBI:mysql:;host=XX.XXX.XX.134;mysql_read_default_group=client
# DSNParser:2343 17044 DBI:mysql:;host=XX.XXX.XX.134;mysql_read_default_group=client awsmaster my_password mysql_enable_utf8=>0, ShowErrorStatement=>1, AutoCommit=>1, RaiseError=>1, PrintError=>0
# DSNParser:2343 17044 DBI:mysql:;host=XX.XXX.XX.134;mysql_read_default_group=client awsmaster my_password mysql_enable_utf8=>0, ShowErrorStatement=>1, AutoCommit=>1, RaiseError=>1, PrintError=>0
Cannot connect to h=XX.XXX.XX.134,p=...,u=awsmaster: DBI connect(';host=XX.XXX.XX.134;mysql_read_default_group=client','awsmaster',...) failed: Can't connect to MySQL server on 'XX.XXX.XX.134' (111) at /usr/bin/pt-online-schema-change line 2345.
...
Cannot connect to h=XX.XXX.XX.134,p=...,u=awsmaster: DBI connect(';host=XX.XXX.XX.134;mysql_read_default_group=client','awsmaster',...) failed: Can't connect to MySQL server on 'XX.XXX.XX.134' (111) at /usr/bin/pt-online-schema-change line 2345.
No slaves found. See --recursion-method if host ip-XXX-XX-X-69 has slaves.
L778 - so the original connection works
Created new table my_schema._test_ken_new OK.
L823 - creating triggers works too
2023-08-18T15:44:50 Created triggers OK.
L882 - now slaves are found by processlist
# MasterSlave:4384 17044 Finding slaves with _find_slaves_by_processlist
# MasterSlave:4455 17044 DBI::db=HASH(0x2458b60) SHOW GRANTS FOR CURRENT_USER()
# MasterSlave:4485 17044 DBI::db=HASH(0x2458b60) SHOW FULL PROCESSLIST
# DSNParser:2181 17044 Parsing h=XX.XXX.XX.134
L921
# MasterSlave:4389 17044 Found 2 slaves
L927 - but this time, this failure seems to surface to output and this appears to trigger failure
Cannot connect to h=XX.XXX.XX.134,p=...,u=awsmaster: DBI connect(';host=XX.XXX.XX.134;mysql_read_default_group=client','awsmaster',...) failed: Can't connect to MySQL server on 'XX.XXX.XX.134' (111) at /usr/bin/pt-online-schema-change line 2345.
# MasterSlave:4366 17044 Recursing from D=my_schema,h=my_host_name.cluster-myclusterid.my-aws-region.rds.amazonaws.com,p=...,t=test_ken,u=awsmaster to h=XX.XXX.XX.134,p=...,u=awsmaster
# MasterSlave:4300 17044 Recursion methods: processlist hosts
# DSNParser:2294 17044 DBI:mysql:;host=XX.XXX.XX.134;mysql_read_default_group=client
# DSNParser:2343 17044 DBI:mysql:;host=XX.XXX.XX.134;mysql_read_default_group=client awsmaster my_password mysql_enable_utf8=>0, ShowErrorStatement=>1, AutoCommit=>1, RaiseError=>1, PrintError=>0
# DSNParser:2343 17044 DBI:mysql:;host=XX.XXX.XX.134;mysql_read_default_group=client awsmaster my_password mysql_enable_utf8=>0, ShowErrorStatement=>1, AutoCommit=>1, RaiseError=>1, PrintError=>0
Cannot connect to h=XX.XXX.XX.134,p=...,u=awsmaster: DBI connect(';host=XX.XXX.XX.134;mysql_read_default_group=client','awsmaster',...) failed: Can't connect to MySQL server on 'XX.XXX.XX.134' (111) at /usr/bin/pt-online-schema-change line 2345.
# ReplicaLagWaiter:5144 17044 All slaves caught up
L940 -- cleanup starts somewhere around here
# CleanupTask:6721 17044 Calling cleanup task CODE(0x25839b8
it seems like dealing with slaves causes failure. I’m wondering why slave activity is needed at all? In our case, this is AWS Aurora, so I’d have expected slave activity is managed by the mysql + managed aurora configuration? Not sure if I’m interpreting this correctly.
I should note… if i simply hard-code my host name around L2345 (the line the error references above), things work.
(also i tried running with --skip-check-slave-lag
and --nocheck-replication-filters
, but get same connection errors)