Pt-online-schema-change --execute fails to connect after MySQL 8 upgrade, but --dry-run still works

Good morning,

If this belongs in the polyglot section, I apologize, but my issue is specific to MySQL 8. We’ve recently upgraded from MySQL 5.7 to MySQL 8 (Running in RDS Aurora 3.05.1). On 5.7, pt-ost worked as expected. We need to use it now on 8 to alter a very large online table. When I do a --dry-run, it connects to the DB and everything looks fine (sensitive data redacted):

[jbuscemi@ip-172-17-20-32 ~]$ pt-online-schema-change --alter "ADD INDEX idx_uuid (uuid)" D=reviews,t=test_reviews,P=3306,h=k8s-qa-****.rds.amazonaws.com -u ***** -p ******* --dry-run --print
Operation, tries, wait:
  analyze_table, 10, 1
  copy_rows, 10, 0.25
  create_triggers, 10, 1
  drop_triggers, 10, 1
  swap_tables, 10, 1
  update_foreign_keys, 10, 1
Starting a dry run.  `reviews`.`test_reviews` will not be altered.  Specify --execute instead of --dry-run to alter the table.
.
.
.
DROP TABLE IF EXISTS `reviews`.`_test_reviews_new`;
2024-01-19T18:53:48 Dropped new table OK.
Dry run complete.  `reviews`.`test_reviews` was not altered.

But when I switch to --execute, pt-ost fails to connect, hangs, times out, and throws this error (sensitive data redacted):


Can't call method "selectrow_array" on an undefined value at /usr/bin/pt-online-schema-change line 4372.

When I check iftop, I see no connection to the DB and a stack trace shows it hanging:

[root@ip-172-17-20-32 ~]# strace -p 2352
strace: Process 2352 attached
connect(5, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("10.*.*.*")}, 16

If I run the same pt-ost command with --dry-run, iftop shows the expected connection and the stack trace shows activity.

I am running the Percona Toolkit in a fully-updated Amazon Linux 2 instance that I spun up yesterday to remove any variables besides pt, mysql, and the OS. The EPEL, mysql80-community-release-el7-5.noarch.rpm and percona-release-latest.noarch.rpm repositories are installed and the Percona Toolkit and mysql client were installed via yum:

5.10.205-195.804.amzn2.x86_64
[root@ip-172-17-20-32 yum.repos.d]# mysql --version
mysql  Ver 8.0.36 for Linux on x86_64 (MySQL Community Server - GPL)
[root@ip-172-17-20-32 yum.repos.d]# pt-online-schema-change --version
pt-online-schema-change 3.5.7
[root@ip-172-17-20-32 yum.repos.d]# openssl version
OpenSSL 1.0.2k-fips  26 Jan 2017

Would anyone know why --execute fails to connect but --dry-run works and how I can fix the problem? Thank you.

One additional data point: I just tried creating an Ubuntu instance to test with. I get the same behavior and same errors.

Hi @JohnB ,
Welcome to the Percona community.

I tried replicating the issue with the same versions of Aurora and pt-online-schema-change but I was not able to.

However, I looked at the source code and noticed that pt-online-schema-change fails at this query:

SELECT @@SERVER_ID

Could you try running this on Aurora server and see what is the output that might be causing this issue ?

Hi,
We have a similar problem, but different settings. All instances are AWS RDS MySQL 5.7 AZ. If the instance does NOT have a read replica instance - it works. But for instances with extra read replica, we get the same error :
Can’t call method “selectrow_array” on an undefined value at /usr/bin/pt-online-schema-change line 4372, line 1.
Dry-run works ok without errors.

Hi, i’m getting the same error on a set of primary+replica on AWS RDS, v8.0.35. Same error and everything.

Hi,
I found around for this :

  • pt-online-schema-change is PERL script and you can edit it
  • see line 2340, after “my $tries = 2;”
  • something is wrong with $cxn_string - it include strange IP address. I don’t know where this is getting IP, because it is not nslookup of RDS URL (also not of RDS replica)
  • first I add “print “\nConnect string before : $cxn_string\n”;” after “my $tries = 2;” (line 2341)
  • after getting this value, I tool IP address and replace it with my primary RDS hostname
  • example : “$cxn_string =~ s/10.1.1.110/mkprod.REMOVEKEY.eu-central-1.rds.amazonaws.com/;”
    And then it is working. I have successful extended 2 RDS instances with primary+replica.

Similar issue: Pt-online-schema-change changes IP of my host - #3 by Ken_Shih
seems like adding --recursion-method=NONE to your command does the trick. It fixed my issues, at least.

1 Like

My thanks for the replies and my apoligies for not being able to get back to everyone sooner.

@smit.arora

mysql> SELECT @@SERVER_ID
    -> ;
+-------------+
| @@SERVER_ID |
+-------------+
|  1171820925 |
+-------------+
1 row in set (0.05 sec)

mysql>

@Matic_Petek and @Charles_Guertin
I was running this test on an Aurora clone with identical settings and the same instance type; but with only the writer and no readers. I will create another clone this week and try both of your suggestions.

Again, many thanks.

Hi @JohnB,

The query you’re failing at is coming from sub recurse_to_slaves. It is making sense that your mysql instance has no replica and hence the failure… Regardless of possible improvement in the tool, you should specifically mention --recursion-method=NONE as discovering replicas is not something you want. Dry run will not validate the replica.

The responses from @Matic_Petek and @Charles_Guertin conveying the same.

Thanks,
K

I was having this issue as well, using AWS Aurora 3 (MySQL 8).

Although adding the --recursion-method=NONE worked, I do actually have replicas. What is the risk associated with executing this without the process being replica-aware?