How to simulate short lock due to connection exhaustion in PXC during pt-online-schema-change?

SQLCaesar · October 29, 2025, 10:17am

Hi all,

I’m trying to simulate a scenario in Percona XtraDB Cluster (PXC) where a long-running pt-online-schema-change operation encounters a short lock or stall near completion - caused by temporary connection exhaustion - Too many connections.

The goal is to better understand how the cluster behaves under this kind of stress and how to mitigate such stalls in production.

What would be a good approach to reproduce this scenario in a test environment?
Specifically, how to combine:

a prolonged pt-online-schema-change run, and
a temporary overload of connections or transactions that leads to a brief lock period?

Any insights or recommended methods for safely reproducing and analyzing this behavior would be greatly appreciated.

Thanks!

matthewb · October 29, 2025, 4:56pm

@SQLCaesar,
PTOSC will acquire a connection, but will not let it go until the operation is completed. Thus, PTOSC should never receive a “Too many connections” error, because it never disconnects (at least, not on purpose). PTOSC will attempt to reconnect, and resume automatically if this happens. Any locks that may happen during the table swap will be detected by PTOSC and re-tried until successful. You can use the --no-swap-tables option:

Using --no-swap-tables will run the whole process, it will create the new table, it will copy all rows but at the end it will drop the new table. It is intended to run a more realistic –dry-run.

SQLCaesar · October 29, 2025, 7:44pm

Hi @matthewb - thank you so much for your valuable insight.

Indeed, our colleagues were reporting an issue with “Too many connections,” which appeared to be caused by a 2-minute lock at the end of a 3.5-hour table alteration using pt-online-schema-change (pt-osc). However, if I understand correctly, you’re saying that pt-online-schema-change itself should not be able to cause a “Too many connections” problem — is that right?

Thank you for pointing me to the --no-swap-tables command — I’ll try it out and share my experience here.

matthewb · October 29, 2025, 7:51pm

Yes, your app might experience ‘too many connections’ because in order to swap the old and new tables, PTOSC must acquire an exclusive lock on the table. Once PTOSC requests this lock (known as a metadata lock or MDL), all queries that come after will be blocked. PTOSC must wait for all queries currently running against the table to finish. Then PTOSC can swap, and release the lock. The 2m lock you are experiencing is PTOSC waiting for the table to be available.

SQLCaesar · November 3, 2025, 6:53pm

Yes, your app might experience ‘too many connections’ because in order to swap the old and new tables, PTOSC must acquire an exclusive lock on the table. Once PTOSC requests this lock (known as a metadata lock or MDL), all queries that come after will be blocked. PTOSC must wait for all queries currently running against the table to finish. Then PTOSC can swap, and release the lock. The 2m lock you are experiencing is PTOSC waiting for the table to be available.

Hi @matthewb , thank you very much for the detailed explanation. I would just like to ask — is there any possible workaround to prevent the “Too many connections” issue from occurring in this case? Is there a way to address or mitigate it? I may be missing something due to limited hands-on experience…

Regarding testing pt-online-schema-change, I tried to reproduce a heavier workload, but I wasn’t able to generate any noticeable load on the database. My test table has the following size:

My test table has following size:

±-----------±--------+
| TABLE_ROWS | size_mb |
±-----------±--------+
| 2619974 | 121.19 |
±-----------±--------+

I used the following command:

pt-online-schema-change --alter “ADD COLUMN test_col INT” --no-swap-tables --execute D=employees,t=salaries --user=root --password=my_password. --host=127.0.0.1

However, I didn’t observe any measurable workload:

mysql> SHOW STATUS LIKE ‘wsrep_flow_control_paused%’;
±-----------------------------±------+
| Variable_name | Value |
±-----------------------------±------+
| wsrep_flow_control_paused_ns | 0 |
| wsrep_flow_control_paused | 0 |
±-----------------------------±------+

Is there any way to put more load on the database so it becomes noticeable?

matthewb · November 4, 2025, 12:50am

Use sysbench to slam 1 node with a 1000 threads. sysbench typically uses 1 table, sbtest1. Use this table with your ptosc.

Also, in MySQL 8, adding a column to the end of the table is “instant”. You shouldn’t need PTOSC for doing something simple like this.

SQLCaesar · November 5, 2025, 8:32pm

Hi @matthewb, thank you for the lesson about adding a column to the end of the table—I didn’t realize that it creates a very poor load on the database.

So finally, I was able to manage to create some load and measure it. To do this, I did the following:

On the first node, I ran these sysbench commands:

Preparing the database:

sysbench /usr/share/sysbench/oltp_read_write.lua
–mysql-host=127.0.0.1
–mysql-user=root
–mysql-password=my_password
–mysql-db=sbtest
–tables=1
–table-size=1000000
prepare

Running the workload:

sysbench /usr/share/sysbench/oltp_read_write.lua
–mysql-host=127.0.0.1
–mysql-user=root
–mysql-password=my_password
–mysql-db=sbtest
–tables=1
–threads=400
–time=60
run

On the second node, I ran the pt-online-schema-change (PTOSC) command:

pt-online-schema-change
–alter “DROP INDEX k_1, ADD INDEX k_1 (k)”
–no-swap-tables
–execute
D=sbtest,t=sbtest1
–user=root
–password=my_password
–host=127.0.0.1

Then, I was able to measure the following load on the database:

However, the sysbench run command, which should have lasted only one minute, lasted more than one hour. Even though I stopped it with Ctrl + C, the flow control paused metric is still increasing. So, I assume that those threads have to continue until they are finished? Is there a proper and safe way to prematurely stop testing load commands?

SQLCaesar · November 6, 2025, 7:30pm

Hi there, I would like to enrich this thread about my another finding.

If I run the almost exact commands again, I was able to simulate a deadlock situation that prevented the load command from continuing.

I executed the commands simultaneously (the test database was already prepared with the sysbench prepare command):

Node 1: Sysbench Write Load

sysbench /usr/share/sysbench/oltp_read_write.lua
–mysql-host=127.0.0.1
–mysql-user=root
–mysql-password=removed
–mysql-db=sbtest
–tables=1
–threads=20
–time=60
run

Node 2: pt-online-schema-change (PTOSC)

pt-online-schema-change
–alter “DROP INDEX k_1, ADD INDEX k_1 (k)”
–no-swap-tables
–execute
D=sbtest,t=sbtest1
–user=root
–password=removed
–host=127.0.0.1

One of these commands subsequently failed with the following error message:

Error copying rows from sbtest.sbtest1 to sbtest._sbtest1_new:
2025-11-06T14:22:17 DBD::mysql::st execute failed: Deadlock found when trying to get lock; try restarting transaction
[for Statement “INSERT LOW_PRIORITY IGNORE INTO sbtest._sbtest1_new (id, k, c, pad) SELECT id, k, c, pad FROM sbtest.sbtest1 FORCE INDEX(PRIMARY) WHERE ((id >= ?)) AND ((id <= ?)) LOCK IN SHARE MODE /pt-online-schema-change 34506 copy nibble/” with ParamValues: 0=3121, 1=21643]
at /usr/bin/pt-online-schema-change line 12267.

The immediate question is what can be done to mitigate this deadlock situation, either in this specific case or in general, assuming the load remains the same? Specifically, which settings best address these deadlocks in a PXC environment?

matthewb · November 6, 2025, 7:48pm

Well, in this particular case, you should not be using PTOSC at all. DROP INDEX and ADD INDEX are both online, background, non-blocking actions. Don’t use PTOSC where it is not needed/required.

SQLCaesar · November 10, 2025, 8:47pm

@matthewb Thank you for another valuable piece of information regarding databases.

Let me sum up this threat:

For a deeper understanding of Percona databases and locking mechanisms, this explanation was very beneficial: the pt-online-schema-change (PTOSC) tool holds connections until the process is complete and should not fail due to connection limits. The actual lock occurs during the Metadata Lock (MDL) phase, specifically during the table swap operation.

Additionally, I was able to simulate load on the database by using the PTOSC command with the --no-swap-tables option, which prevents the final table exchange. I ran the sysbench command simultaneously with the PTOSC command. It should be kept in mind, however, that actions like DROP INDEX and ADD INDEX are generally online, non-blocking, and background operations, so they shouldn’t be used where they aren’t necessary. Furthermore, using PTOSC for adding a column to the end of the table is useless, as this operation is practically instantaneous.

Topic		Replies	Views
Pt online schema change is very slow on prod server General Questions	8	2286	January 27, 2022
Deadlocks when running pt-online-schema-change on XtraDB cluster Percona XtraDB Cluster 5.x	3	1604	February 27, 2014
Pt-online-schema-change Running simultaneously for multiple databases Percona Toolkit mysql , percona	4	619	May 28, 2024
While executing pt-osc i can see lot of deadlock errors	10	2185	February 25, 2022
pt-online-schema-change crashed mysql within 5 minutes of starting job Percona Toolkit	1	2440	November 17, 2015

How to simulate short lock due to connection exhaustion in PXC during pt-online-schema-change?

Related topics