Pt-online-schema-change --execute fails to connect after MySQL 8 upgrade, but --dry-run still works

Good morning,

If this belongs in the polyglot section, I apologize, but my issue is specific to MySQL 8. We’ve recently upgraded from MySQL 5.7 to MySQL 8 (Running in RDS Aurora 3.05.1). On 5.7, pt-ost worked as expected. We need to use it now on 8 to alter a very large online table. When I do a --dry-run, it connects to the DB and everything looks fine (sensitive data redacted):

[jbuscemi@ip-172-17-20-32 ~]$ pt-online-schema-change --alter "ADD INDEX idx_uuid (uuid)" D=reviews,t=test_reviews,P=3306,h=k8s-qa-****.rds.amazonaws.com -u ***** -p ******* --dry-run --print
Operation, tries, wait:
  analyze_table, 10, 1
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Starting a dry run.  `reviews`.`test_reviews` will not be altered.  Specify --execute instead of --dry-run to alter the table.
.
.
.
DROP TABLE IF EXISTS `reviews`.`_test_reviews_new`;
2024-01-19T18:53:48 Dropped new table OK.
Dry run complete.  `reviews`.`test_reviews` was not altered.

But when I switch to --execute, pt-ost fails to connect, hangs, times out, and throws this error (sensitive data redacted):


Can't call method "selectrow_array" on an undefined value at /usr/bin/pt-online-schema-change line 4372.

When I check iftop, I see no connection to the DB and a stack trace shows it hanging:

[root@ip-172-17-20-32 ~]# strace -p 2352
strace: Process 2352 attached
connect(5, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("10.*.*.*")}, 16

If I run the same pt-ost command with --dry-run, iftop shows the expected connection and the stack trace shows activity.

I am running the Percona Toolkit in a fully-updated Amazon Linux 2 instance that I spun up yesterday to remove any variables besides pt, mysql, and the OS. The EPEL, mysql80-community-release-el7-5.noarch.rpm and percona-release-latest.noarch.rpm repositories are installed and the Percona Toolkit and mysql client were installed via yum:

5.10.205-195.804.amzn2.x86_64
[root@ip-172-17-20-32 yum.repos.d]# mysql --version
mysql  Ver 8.0.36 for Linux on x86_64 (MySQL Community Server - GPL)
[root@ip-172-17-20-32 yum.repos.d]# pt-online-schema-change --version
pt-online-schema-change 3.5.7
[root@ip-172-17-20-32 yum.repos.d]# openssl version
OpenSSL 1.0.2k-fips  26 Jan 2017

Would anyone know why --execute fails to connect but --dry-run works and how I can fix the problem? Thank you.

One additional data point: I just tried creating an Ubuntu instance to test with. I get the same behavior and same errors.

Hi @JohnB ,
Welcome to the Percona community.

I tried replicating the issue with the same versions of Aurora and pt-online-schema-change but I was not able to.

However, I looked at the source code and noticed that pt-online-schema-change fails at this query:

SELECT @@SERVER_ID

Could you try running this on Aurora server and see what is the output that might be causing this issue ?

Hi,
We have a similar problem, but different settings. All instances are AWS RDS MySQL 5.7 AZ. If the instance does NOT have a read replica instance - it works. But for instances with extra read replica, we get the same error :
Can’t call method “selectrow_array” on an undefined value at /usr/bin/pt-online-schema-change line 4372, line 1.
Dry-run works ok without errors.

Hi, i’m getting the same error on a set of primary+replica on AWS RDS, v8.0.35. Same error and everything.

Hi,
I found around for this :

  • pt-online-schema-change is PERL script and you can edit it
  • see line 2340, after “my $tries = 2;”
  • something is wrong with $cxn_string - it include strange IP address. I don’t know where this is getting IP, because it is not nslookup of RDS URL (also not of RDS replica)
  • first I add “print “\nConnect string before : $cxn_string\n”;” after “my $tries = 2;” (line 2341)
  • after getting this value, I tool IP address and replace it with my primary RDS hostname
  • example : “$cxn_string =~ s/10.1.1.110/mkprod.REMOVEKEY.eu-central-1.rds.amazonaws.com/;”
    And then it is working. I have successful extended 2 RDS instances with primary+replica.

Similar issue: Pt-online-schema-change changes IP of my host - #3 by Ken_Shih
seems like adding --recursion-method=NONE to your command does the trick. It fixed my issues, at least.

My thanks for the replies and my apoligies for not being able to get back to everyone sooner.

@smit.arora

mysql> SELECT @@SERVER_ID
    -> ;
+-------------+
| @@SERVER_ID |
+-------------+
|  1171820925 |
+-------------+
1 row in set (0.05 sec)

mysql>

@Matic_Petek and @Charles_Guertin
I was running this test on an Aurora clone with identical settings and the same instance type; but with only the writer and no readers. I will create another clone this week and try both of your suggestions.

Again, many thanks.

Hi @JohnB,

The query you’re failing at is coming from sub recurse_to_slaves. It is making sense that your mysql instance has no replica and hence the failure… Regardless of possible improvement in the tool, you should specifically mention --recursion-method=NONE as discovering replicas is not something you want. Dry run will not validate the replica.

The responses from @Matic_Petek and @Charles_Guertin conveying the same.

Thanks,
K