Percona 8 statement "Hangs" with waiting for handler commit state

We have been fighting with an issue where when our database server is under what i would consider a small amount of stress the database gets “hung up” on simple queries leaving them in the state of “waiting for handler commit” for several minutes or more.

An example of a time when we are seeing this:

One connection is loading a table with around 500k entries using a Load Data Infile statement:
LOAD DATA INFILE ‘xxxx’ INTO TABLE yyy CHARACTER SET latin1 FIELDS TERMINATED BY ‘,’ (column1,column2,column3,etc);

While that is running we have an alter statement to 3 small tables, a create table like another table, and a truncate statement running.

With those 5 statements running all other “simple” deletes or updates to a processing table using indexed field for row identification just sit in a “waiting for handler commit” state. Once the larger alter/loads complete the system normally goes back to its happy self like nothing ever happened.

Initial investigation has found:
Sorry unable to post links due to being a new user here

An article on Stackoverflow about “very-slow-writes-on-mysql-8-waiting-for-handler-commit”.
It was suggesting a few changes for IO capacity and a few other adjustments, this had no noticeable impact. Sorry cant put link here due to limit on new user

Suggestions:
innodb_lru_scan_depth=100 # from 1024 to conserve 90% CPU cycles used for function
innodb_io_capacity=1900 # from 200 to allow more IOPSecond to your storage device
innodb_flush_neighbors=2 # from 0 to expedite writing to current extent
innodb_max_dirty_pages_pct_lwm=1 # from 10 percent to expedite writes
innodb_max_dirty_pages_pct=1 # from 90 percent to reduce innodb_buffer_pool_pages_dirty count
innodb_change_buffer_max_size=50 # from 25 percent to expedite your high volume activity

Second one was from itectec “mysql-v8-replication-lagging-waiting-for-handler-commit”.
Suggest that this is related to disabling innodb_adaptive_hash_index this did seem to have a minimal impact but problem still persists.

I am not able to readily reproduce this issue but it does seem like when it happens the longest running items on that server are Load Data statements or alter/update to larger tables.

I have attempted to double/triple the RAM associated to this and adjust the buffer pool with minimal success. I have verified with the data center that we are not hitting any sort of IO limit(typically in the 1500 range)

Server information:

The particular server that we are working with is sitting in a private cloud on Enterprise Grade SSD, has 16 vCPUs and 32GB of RAM. This server is setup to be a primary replication server with a single replica. Currently using Percona 8.0.26

SHOW GLOBAL VARIABLES - Global Variables - Pastebin.com
SHOW GLOBAL STATUS - GLOBAL STATUS - Pastebin.com

Only other thing that comes to my head that is worth mentioning is that this all started happening when we upgraded from Percona 5.7 to Percona 8 and does seem to happen in all of our shared environment.

Any suggestions on where to look next?

As always, appreciate any and all help on this!

1 Like

Hi!

What MySQL 8 version are you in ?
Are the updates/deletes sitting in “waiting for handler to commit” related to any of the other tables that are being modified?
Is the server dedicated to MySQL or does share resources/storage with other apps?
What storage type do you haave? and are you sure it’s not saturated?

All the operations that you describe are IO bound and makes sense that (if IO sub system is saturated), other write processes need to wait, but only from the message provided it cannot be inferred what is causing the processes to block.
To truly know what is happening, some troubleshooting should be done by either SEIS (show engine innodb status) for contention, flamegraphs ( Profiling Software Using perf and Flame Graphs - Percona Database Performance Blog) , or pt-stalk and check how mysql status variables change (you should be familiar with internal status variables to identify what is wrong).

I would say that you should try to run busy/expensive queries like DDLs (alters) or data load while on business hours, or at least not all simultaneously.

Regards

1 Like

Appreciate the response.

What MySQL 8 version are you in ?
We are using Percona Server 8.0.26

Are the updates/deletes sitting in “waiting for handler to commit” related to any of the other tables that are being modified?
These are related to other tables not the ones directly being altered or updated

Is the server dedicated to MySQL or does share resources/storage with other apps?
This is a dedicated MySQL machine in a private cloud, no other items are sharing these resources

What storage type do you haave? and are you sure it’s not saturated?
This is sitting on a private cloud environment with Enterprise Grade SSD. I have been told by the Data Center that out entire environment pulls around 15K IOPS and the infrastructure is capable of far higher. They proved this to me by showing some of the back end monitoring reports. I am still skeptical that this is true based on what we keep seeing.

I will look into some of the troubleshooting options you have mentioned and see if anything else comes up when this occurs next time. Thanks for this information.

I would say that you should try to run busy/expensive queries like DDLs (alters) or data load while on business hours, or at least not all simultaneously.

These DDL operations are part of the load of data files that are needed throughout the day. IE end user uploads a batch file that requires processing, we use the data loads to insert the data from csv files or from other select into outfiles to process their data. We did seem to notice that since we moved to Percona 8 vs Percona 5.7 that the load data file seems to be more system intensive or have a higher priority operations, this is purely based on observation nothing backing it up.

1 Like

Do all tables involved have an int-based primary key?
Can you try innodb_flush_log_at_trx_commit=0 and innodb_log_file_size=1G and innodb_log_buffer_size=64M

1 Like

Hello Mathew,
Thanks for the response.

Do all tables involved have an int-based primary key?
I reviewed the tables that were in this particular instance and they seem to all have primary keys that have int type. We do have a few tables that are using a varchar(30) as the primary key. I am curious as to the reasoning behind the question. This key has been used for a while now but has always bothered me that it was a varchar(30) and not something like and integer but it pre-dates me here. Is this something we should look into changing?

Can you try innodb_flush_log_at_trx_commit=0 and innodb_log_file_size=1G and innodb_log_buffer_size=64M

I will give these changes a shot on my test DB and see what the outcome is.

As always, appreciate the help and insight you are able to give.

1 Like

In that article, Michael uses a CHAR(36) as the PK. Same effect as VARCHAR(30). Random IO, bad disk performance on writes.

1 Like

Thanks for this information, it is very helpful!

1 Like

I was able to attempt this in our test environment. It does seem like it has had some positive impact but still seem to randomly get these hang-ups. We initially had the log file size set to the suggestion, only modified the innodb_flush_log_at_trx_commit and innodb_log_buffer_size in our case. We will continue seeing what else we can find. Again appreciate the information.

1 Like

Hi again,

innodb_log_file_size should be tuned to hold around 1 hour worth of write traffic at peak load as explained here: How to calculate a good InnoDB log file size - Percona Database Performance Blog
depending on how big your ETL/data load size is, you might require a bigger than 1 GB redo log

If you are on a cloud provider, it might be possible that after running at peak load you run out of burst credits: Understand Amazon EFS burst credits
Double check that write load is not spiking/saturating all of a sudden and for a long period.

if dataset/operations increase monotonically over time, after certain threshold you reach some limit/boundaries and performance degrades.
it is strongly suggested to setup a monitoring system in place (such as PMM free and open source which you can take a look here Grafana ) so that you can check how MySQL behaves and where the bottlened/satured subsystem is

1 Like

I want to thank @matthewb and @CTutte for all the help in this process. While we are still seeing some performance issues at times it we have been able to stabilize the environment for the most part.
In our case we have found:

  • There was a artificial limit that seems to have been put on our environment’s storage policy
  • Right sizing the innodb log file based on what we learned within PMM, in our case a few of our environments were writing 10-20GB per hour
    • We ended up finding that in our case setting innodb_log_file_size = 5G and innodb_log_files_in_group = 4 to get us to the 20GB seems to have worked the best
  • The last adjustment that we ended up making was adding additional 32GB of RAM to increase the buffer pool
  • Completely agree with @CTutte about getting PMM installed it has been a great help in this process

One thing that we are still noticing are that when a larger Load Data statement is running, say with 200k entries the system is slowed down. Upon investigating this more the disk setup on this machine has the temp location for these files and the DB running on the same disk. Moving this to a separate disk in my test environment seems to have helped alleviate this.

1 Like

That will certainly hurt. I would go grab sysbench and run some basic disk benchmarks for rndwr (random read/write) and see what sort of sustained (1-3hrs test duration) throughput you can get.

1 Like