Hi all,
I’m trying to simulate a scenario in Percona XtraDB Cluster (PXC) where a long-running pt-online-schema-change operation encounters a short lock or stall near completion - caused by temporary connection exhaustion - Too many connections.
The goal is to better understand how the cluster behaves under this kind of stress and how to mitigate such stalls in production.
What would be a good approach to reproduce this scenario in a test environment?
Specifically, how to combine:
a prolonged pt-online-schema-change run, and
a temporary overload of connections or transactions that leads to a brief lock period?
Any insights or recommended methods for safely reproducing and analyzing this behavior would be greatly appreciated.
Thanks!
@SQLCaesar,
PTOSC will acquire a connection, but will not let it go until the operation is completed. Thus, PTOSC should never receive a “Too many connections” error, because it never disconnects (at least, not on purpose). PTOSC will attempt to reconnect, and resume automatically if this happens. Any locks that may happen during the table swap will be detected by PTOSC and re-tried until successful. You can use the --no-swap-tables option:
Using --no-swap-tables will run the whole process, it will create the new table, it will copy all rows but at the end it will drop the new table. It is intended to run a more realistic –dry-run.
1 Like
Hi @matthewb - thank you so much for your valuable insight.
Indeed, our colleagues were reporting an issue with “Too many connections,” which appeared to be caused by a 2-minute lock at the end of a 3.5-hour table alteration using pt-online-schema-change (pt-osc). However, if I understand correctly, you’re saying that pt-online-schema-change itself should not be able to cause a “Too many connections” problem — is that right?
Thank you for pointing me to the --no-swap-tables command — I’ll try it out and share my experience here.
Yes, your app might experience ‘too many connections’ because in order to swap the old and new tables, PTOSC must acquire an exclusive lock on the table. Once PTOSC requests this lock (known as a metadata lock or MDL), all queries that come after will be blocked. PTOSC must wait for all queries currently running against the table to finish. Then PTOSC can swap, and release the lock. The 2m lock you are experiencing is PTOSC waiting for the table to be available.
1 Like
Yes, your app might experience ‘too many connections’ because in order to swap the old and new tables, PTOSC must acquire an exclusive lock on the table. Once PTOSC requests this lock (known as a metadata lock or MDL), all queries that come after will be blocked. PTOSC must wait for all queries currently running against the table to finish. Then PTOSC can swap, and release the lock. The 2m lock you are experiencing is PTOSC waiting for the table to be available.
Hi @matthewb , thank you very much for the detailed explanation. I would just like to ask — is there any possible workaround to prevent the “Too many connections” issue from occurring in this case? Is there a way to address or mitigate it? I may be missing something due to limited hands-on experience…
Regarding testing pt-online-schema-change, I tried to reproduce a heavier workload, but I wasn’t able to generate any noticeable load on the database. My test table has the following size:
My test table has following size:
±-----------±--------+
| TABLE_ROWS | size_mb |
±-----------±--------+
| 2619974 | 121.19 |
±-----------±--------+
I used the following command:
pt-online-schema-change --alter “ADD COLUMN test_col INT” --no-swap-tables --execute D=employees,t=salaries --user=root --password=my_password. --host=127.0.0.1
However, I didn’t observe any measurable workload:
mysql> SHOW STATUS LIKE ‘wsrep_flow_control_paused%’;
±-----------------------------±------+
| Variable_name | Value |
±-----------------------------±------+
| wsrep_flow_control_paused_ns | 0 |
| wsrep_flow_control_paused | 0 |
±-----------------------------±------+
Is there any way to put more load on the database so it becomes noticeable?
Use sysbench to slam 1 node with a 1000 threads. sysbench typically uses 1 table, sbtest1. Use this table with your ptosc.
Also, in MySQL 8, adding a column to the end of the table is “instant”. You shouldn’t need PTOSC for doing something simple like this.
1 Like
Hi @matthewb, thank you for the lesson about adding a column to the end of the table—I didn’t realize that it creates a very poor load on the database.
So finally, I was able to manage to create some load and measure it. To do this, I did the following:
On the first node, I ran these sysbench commands:
Preparing the database:
sysbench /usr/share/sysbench/oltp_read_write.lua
–mysql-host=127.0.0.1
–mysql-user=root
–mysql-password=my_password
–mysql-db=sbtest
–tables=1
–table-size=1000000
prepare
Running the workload:
sysbench /usr/share/sysbench/oltp_read_write.lua
–mysql-host=127.0.0.1
–mysql-user=root
–mysql-password=my_password
–mysql-db=sbtest
–tables=1
–threads=400
–time=60
run
On the second node, I ran the pt-online-schema-change (PTOSC) command:
pt-online-schema-change
–alter “DROP INDEX k_1, ADD INDEX k_1 (k)”
–no-swap-tables
–execute
D=sbtest,t=sbtest1
–user=root
–password=my_password
–host=127.0.0.1
Then, I was able to measure the following load on the database:
mysql> SHOW GLOBAL STATUS LIKE ‘wsrep_flow_control_paused%’;
±-----------------------------±--------------+
| Variable_name | Value |
±-----------------------------±--------------+
| wsrep_flow_control_paused_ns | 4394783604165 |
| wsrep_flow_control_paused | 0.863075 |
±-----------------------------±--------------+
2 rows in set (0.00 sec)
However, the sysbench run command, which should have lasted only one minute, lasted more than one hour. Even though I stopped it with Ctrl + C, the flow control paused metric is still increasing. So, I assume that those threads have to continue until they are finished? Is there a proper and safe way to prematurely stop testing load commands?
Hi there, I would like to enrich this thread about my another finding.
If I run the almost exact commands again, I was able to simulate a deadlock situation that prevented the load command from continuing.
I executed the commands simultaneously (the test database was already prepared with the sysbench prepare command):
Node 1: Sysbench Write Load
sysbench /usr/share/sysbench/oltp_read_write.lua
–mysql-host=127.0.0.1
–mysql-user=root
–mysql-password=removed
–mysql-db=sbtest
–tables=1
–threads=20
–time=60
run
Node 2: pt-online-schema-change (PTOSC)
pt-online-schema-change
–alter “DROP INDEX k_1, ADD INDEX k_1 (k)”
–no-swap-tables
–execute
D=sbtest,t=sbtest1
–user=root
–password=removed
–host=127.0.0.1
One of these commands subsequently failed with the following error message:
Error copying rows from sbtest.sbtest1 to sbtest._sbtest1_new:
2025-11-06T14:22:17 DBD::mysql::st execute failed: Deadlock found when trying to get lock; try restarting transaction
[for Statement “INSERT LOW_PRIORITY IGNORE INTO sbtest._sbtest1_new (id, k, c, pad) SELECT id, k, c, pad FROM sbtest.sbtest1 FORCE INDEX(PRIMARY) WHERE ((id >= ?)) AND ((id <= ?)) LOCK IN SHARE MODE /pt-online-schema-change 34506 copy nibble/” with ParamValues: 0=3121, 1=21643]
at /usr/bin/pt-online-schema-change line 12267.
The immediate question is what can be done to mitigate this deadlock situation, either in this specific case or in general, assuming the load remains the same? Specifically, which settings best address these deadlocks in a PXC environment?
Well, in this particular case, you should not be using PTOSC at all. DROP INDEX and ADD INDEX are both online, background, non-blocking actions. Don’t use PTOSC where it is not needed/required.
1 Like