I have just put a new server into production (48x Xeon Silver 4310 CPU, 128GiB RAM, NVMe disks) with the (nearly) latest Percona release (Server version: 8.0.30-22 Percona Server (GPL), Release '22', Revision '7e301439b65').
It replaces an 8 year old server (40 Xeon E5-2670, 128GiB RAM, SSD disks) running Percona 8.0.27
The data directories on both servers are on ext4 partitions.
On paper, the new machine should run rings around the old one (and fio shows that the disk throughput on the NVMe disks are way above the SSD disks). Except… it doesn’t.
Running a specific complex query on the old machine averages about 11 seconds. The same query on the same dataset on the new machine averages about 75 seconds.
I can paste the mysql configuration but you can consider that it’s already tuned pretty well (innodb buffer pool size around 75% of RAM, flush neighbors 0, numa interleave, flush O_DIRECT_NO_FSYNC and so on)
The other difference is that the old machine is on Ubuntu Bionic, the new one is running Ubuntu Focal.
I’ve tried a few different things to improve performance (or even diagnose why it’s so slow) but I’m running out of ideas.
I’m looking for some suggestions of things to look at, or things in the mysql configuration that should be looked at closely. To my mind, the biggest change is SSD versus NVMe, and I’m wondering if there is some bottleneck either on the server or mysql that needs to be addressed.
Hi, first of all check that the query plan is the same on both servers using EXPLAIN. Also, what method did you use to copy the database to the new server? did you consider the effects of cold buffer on the new server? all of those things might affect the query time.
As to cold buffer, yes of course. That’s what I meant by “averages about” x seconds. The query was run consectively on each server multiple times.
I compleletely overlooked the idea of checking the execution plans… indeed they are different.
On the “fast” machine:
explain select count(*), max_rank from (select max(`rank`) as max_rank from collection_has_video group by collection_id) X group by max_rank order by max_rank;
+----+-------------+----------------------+------------+-------+------------------+---------+---------+------+----------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------+------------+-------+------------------+---------+---------+------+----------+----------+---------------------------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 33669057 | 100.00 | Using temporary; Using filesort |
| 2 | DERIVED | collection_has_video | NULL | index | PRIMARY,video_id | PRIMARY | 12 | NULL | 33669057 | 100.00 | NULL |
+----+-------------+----------------------+------------+-------+------------------+---------+---------+------+----------+----------+---------------------------------+
On the slow machine:
explain select count(*), max_rank from (select max(`rank`) as max_rank from collection_has_video group by collection_id) X group by max_rank order by max_rank;
+----+-------------+----------------------+------------+------+------------------+------+---------+------+----------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------+------------+------+------------------+------+---------+------+----------+----------+---------------------------------+
| 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 34039263 | 100.00 | Using temporary; Using filesort |
| 2 | DERIVED | collection_has_video | NULL | ALL | PRIMARY,video_id | NULL | NULL | NULL | 34039266 | 100.00 | Using temporary |
+----+-------------+----------------------+------------+------+------------------+------+---------+------+----------+----------+---------------------------------+
So looks like something funny is going on with the derived table. I diffed the optimizer variables and they are identical, but at least this gives me something to look at. Thanks!
@dla,
Consider turning on optimizer tracing and compare all of the considered plans between old and new server. More easily, use the new EXPLAIN ANALYZE to get a more in-depth analysis of the execution path and compare old/new.
For some forgotten reason, the server had optimizer_switch = 'prefer_ordering_index=off' and on the old servers this had no particular impact on performance.
Some change between 8.0.27 and 8.0.28 means that this setting is acted upon in a way that is much more visible in the query plans produced.
Since I can find no trace in the ticketing system or the config repo as to why this was done in the first place, I reverted it and that brought the performance back in line. I have also removed it from a couple of current servers and see no negative effects.
I’m kind of kicking myself about not having thought to look at the execution plans earlier
Anyway, all is good now, thanks for the suggestions.