Error / Disconnections with pt-online-schema-change

Farhan_Umer · September 28, 2023, 6:13am

Hi,

I am running pt-online-schema-change on a huge table that will complete around 5 days but i am seeing some reconnections maybe as i am not sure i have never seen this warning / error before. This cannot connect error i am talking about, please see below for details. Will it miss some rows during the process not copy all the data or this is just a reconnection and data integrity and consistency with not be damaged. Should i cancel the process, its already bin 5 days its running.

Copying database.<table-name>: 61% 2+00:11:32 remain
Cannot connect to h=X.X.X.X,u=root: DBI connect(‘;host=X.X.X.X;mysql_read_default_group=client’,‘root’,…) failed: Can’t connect to MySQL server on ‘X.X.X.X:3306’ (110) at /usr/bin/pt-online-schema-change line 2345.

Copying database.<table-name>: 64% 1+20:37:43 remain
Cannot connect to h=X.X.X.X,u=root: DBI connect(‘;host=X.X.X.X;mysql_read_default_group=client’,‘root’,…) failed: Can’t connect to MySQL server on ‘X.X.X.X:3306’ (110) at /usr/bin/pt-online-schema-change line 2345.

Copying database.<table-name>: 67% 1+16:59:18 remain
Cannot connect to h=X.X.X.X,u=root: DBI connect(‘;host=X.X.X.X;mysql_read_default_group=client’,‘root’,…) failed: Can’t connect to MySQL server on ‘X.X.X.X:3306’ (110) at /usr/bin/pt-online-schema-change line 2345.

matthewb · September 28, 2023, 5:50pm

pt-osc makes many connections to your server. One connection is watching load, and other variables, and other connection is handling the data copy. We can’t tell from the error message which one is failing. Even if it was the data copy thread, as long as pt-osc continues running, the thread knows what was the last successful row copied to the new table and resumes. The triggers placed on the source table catch other changes by the applications during reconnects. You should be fine to continue.
May I ask how large is this table? 5 days is very very very odd. How much buffer pool is configured? SSDs?

Farhan_Umer · September 30, 2023, 6:15am

The table size is around 4.5 TB and yes it is SSD. We ran this 3 times on a cloned instance to verify mysql8 upgrade but never got this disconnections error. It is at 95% right now. Seems like it is going to copy table and swap any time today. The upgrade on the cloned instances worked fine. The table ALTER was required with no addition of columns or anything. i guess mysql8 upgrade requires this.

matthewb · October 2, 2023, 8:14pm

Does this table have any DATETIME/TIMESTAMP columns created in MySQL 5.6? I’m surprised the upgrade process is forcing this table rebuild. I’ve done multi-TB upgrades from 5.7 to 8 and never had MySQL force a rebuild of a table during the upgrade process.
Did you run the MySQL 8 Upgrade Checker utility on 5.7? It might have informed you as to why a rebuild would be necessary.

Farhan_Umer · October 3, 2023, 12:32pm

i checked the upgrade using mysqlsh and all it reports is

“id”: “checkTableOutput”,
“title”: “Issues reported by ‘check table x for upgrade’ command”,
“status”: “OK”,
“detectedProblems”: [
{
“level”: “Error”,
“dbObject”: “data.table”,
“description”: “Table rebuild required. Please do "ALTER TABLE table FORCE" or dump/reload to fix it!”
}

Farhan_Umer · October 3, 2023, 12:37pm

and unfortunately it failed after 9 days with following error:

2023-10-02T08:23:30 Error copying rows from database.table to database._table_new: 2023-10-02T08:23:

30 DBD::mysql::st execute failed: Lock wait timeout exceeded; try restarting transaction [for Statement "INSERT LOW_PRIORITY IGNORE INTO

database._table_new () LOCK IN SHARE MOD

E /pt-online-schema-change 405868 copy nibble/" with ParamValues: 0=3439513469, 1=3539674433] at /usr/bin/pt-online-schema-change line

11789

matthewb · October 3, 2023, 2:17pm

That’s unfortunate but it can happen as pt-online-schema-change is just another connection that is competing with your application for MySQL resources, locks, etc.

I would recommend that you take the approach of upgrading/ALTERing a replica, then promote that replica to become the new source. I would also recommend reviewing your business policies on why a single table needs to be 4.5TB in size. As you can see, it is quite difficult to manage such a table. Perhaps 1/2 the data could be archived to another MySQL server, or offloaded to S3 bucket.

Farhan_Umer · October 12, 2023, 10:13am

Hi Matt,

Thanks for your help, can you please suggest chunk size, we have approx 3515676474 row in this table. Should i set it to 100000 or 10000000, considering IOPs don’t get overloaded towards the end of the execution, it completed the table 98% and then aws is showing high values for following which slows down last 2% which is like 10 millions rows.

wait/io/aurora_redo_log_flush
wait/synch/cond/innodb/row_lock_wait_cond

I am thinking it of keeping it 100000 so that IOPs keep low during the execution of pt-osc. What is your advise.

Thanks

Farhan_Umer · October 12, 2023, 10:16am

and can you please advise if any parameter can help with huge table alter statement

perl /usr/bin/pt-online-schema-change --host=host.mysql.amazon.aurora.com --user=my-user --alter FORCE; D=database,t=table --set-vars lock_wait_timeout=10800 --preserve-triggers --alter-foreign-keys-method=drop_swap --max-load Threads_running=2000 --critical-load Threads_running=4000 --chunk-size 100000000 --execute

Farhan_Umer · October 14, 2023, 7:31pm

@matthewb It would really appreciate if you can comment on my couple of notes above. Thanks

matthewb · October 15, 2023, 2:14pm

A chunk-size of 1000000000 means “INSERT INTO X SELECT * FROM Y WHERE PK BETWEEN 1 AND 1000000000”. Is that what you are intending? To have single transactions that large? My personal preference is to leave chunk-size at the default, because pt-osc will automatically increase the chunk size as the copy progresses. chunk-size is auto-sized based on --chunk-time which will attempt to do the maximum amount of rows per time interval. I would instead increase the time to 5s.

I would remove your set-vars lock_wait_timeout and instead set a higher retry count using --tries. pt-osc sets a low timeout on itself so that it doesn’t impact the application.

I’m also going to mention something I said above which is, you may need to consider doing your mysql 8 upgrade as a replica promotion instead of a direct upgrade. If this table is too large and too loaded with application traffic, you may just need to upgrade a replica and swap it. Or use AWS’s blue/green thing to upgrade.

Farhan_Umer · October 23, 2023, 3:57pm

@matthewb I tired the blue green method and ran pt-online-schema-change on the large table. Thanks for your advise i was able to alter table but what happened is AWS blue green replicates data through logical replication between blue and green environments and replication broke when table was swapped after pt-online-schema-change altered the table successfully.

I am seeing this error now:

Could not execute Update_rows event on table database.table; Can’t find record in ‘table’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event’s master log mysql-bin-changelog.510406, end_log_pos 112011915

           Slave_IO_State: Waiting for the slave SQL thread to free enough relay log space
              Master_Host: 172.17.15.0
              Master_User: rdsrepladmin
              Master_Port: 3306
            Connect_Retry: 60
          Master_Log_File: mysql-bin-changelog.510413
      Read_Master_Log_Pos: 60461721
           Relay_Log_File: relaylog.008334
            Relay_Log_Pos: 112011480
    Relay_Master_Log_File: mysql-bin-changelog.510406
         Slave_IO_Running: Yes
        Slave_SQL_Running: No
          Replicate_Do_DB: 
      Replicate_Ignore_DB: 
       Replicate_Do_Table: 
   Replicate_Ignore_Table: mysql.rds_replication_status,mysql.rds_monitor,mysql.rds_sysinfo,mysql.rds_configuration,mysql.rds_history
  Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:
Last_Errno: 1032
Last_Error: Could not execute Update_rows event on table database.table; Can’t find record in ‘table’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event’s master log mysql-bin-changelog.510406, end_log_pos 112011915

matthewb · October 23, 2023, 4:26pm

Hmm. I’m not sure. It might be worth it to manually stop replication (CALL mysql.rds_stop_replication()) for the duration of the pt-osc execution, then re-enable it after the table swap.
pt-osc has various hooks where you could get more granular and have pt-osc keep repl running and stop it just before the swap and enable after the drop/rename.

Yuanlin_Wu · November 9, 2023, 11:18pm

Hi,

We are facing similar use case for this. Could you please share how you managed to complete it at last?

Thanks!

Farhan_Umer · November 10, 2023, 7:14am

@Yuanlin_Wu You mean table size is huge and table alter is not successful even after running for long time?

Yuanlin_Wu · November 10, 2023, 10:01pm

Yes. How did you manage to alter the table at last? Did you use pt-osc with blue/green cluster?

Farhan_Umer · November 12, 2023, 5:42pm

It has not worked for me yet but i am trying blue green with no swapping of tables and do a manual table swap with replication stopped while manual swap and then resume the replication. I will let you know how it goes.

Farhan_Umer · November 14, 2023, 11:01am

@Yuanlin_Wu did any strategy worked for you? We have table with 3.5 billion rows, take 7 days to reach 99% with pt-osc and when it swap the table, break the replication. manual table swap is not an option, it has to be automatic with stopped replication at 99% in blue green deployment and resumes after table is successfully altered.

MahendraB · November 21, 2024, 2:43am

Can you share the hooks for pt-osc in a situation wherein the replication is running and we need to stop it just before the swap and enable after the drop/rename?

Reason why I am asking is because our ALTER got completed successfully, however, it broke the replication after swap.

Thank you in advance.

matthewb · November 22, 2024, 10:46pm

You’ll have to write the code to manage this, in perl.

Topic		Replies	Views
pt-online-schema-change execute very slowly General Questions troubleshooting , percona	1	2008	September 17, 2020
Running pt-osc on RDS instance to alter a table with 1.9 billion records Percona Toolkit	1	1315	February 6, 2023
Pt-osc Weird Drop Table Result percona	2	1038	November 10, 2022
Percona Toolkit "Not dropping triggers because the tool was interrupted" Percona Toolkit	3	1358	November 17, 2022
Pt-online-schema-change slow down Other MySQL® Questions	1	1138	November 4, 2021

Error / Disconnections with pt-online-schema-change

Related topics