Running pt-osc on RDS instance to alter a table with 1.9 billion records

I have an RDS instance running MySQL 5.5.46 which has a table with a primary key of int that it is currently at 1.9 billion records and approaching the 2.1 billion limit and ~425GB in size. I’m attempting to use pt-online-schema-change to alter the column to a bigint.

I was able to successfully test the change on a test server (m3.2xlarge) and, while it took about 7 days to complete, it did finish successfully. This test server was under no additional load. (Side note: 7 days seemed like a LONG time).

For the production environment, there is no replication/slave present (but there is Multi-AZ) and, to help with resource contention and speed things up, I’m using an r3.8xlarge instance type.

After two attempts, the production migration would get to about 50% and a 1 day left and then the RDS would seemingly stop accepting connections forcing the pt-osc both times to roll back or fail outright, because the RDS needed to be rebooted.

I don’t see anything obvious in the RDS console or logs to help indicate why this happened, and I feel like the instance type should be able to handle a lot of connections/load.

Looking at the CloudWatch metrics during my now third attempt, the database server itself doesn’t seem to be under much load: 5% CPU, 59 DB Connections, 45GB Freeable Memory, Write IOPS ~2200-2500.

Wondering if anyone has ran into this situation and, if so, what helped with the connection issue?

If anyone has suggestions on how to speed up the process in general I’d love to hear. I was considering trying a larger chunk-size and off hours, but wasn’t sure how that would end up affecting the application.