MySQL showing randomly poor performance with nested loop antijoin

I have a very simple query which is randomly performing poorly in production. I have the EXPLAIN ANALYZE outputs for both the good case and the bad case.

MySQL version is 8.0.31.

This is the query:

SELECT  COUNT(distinct(u.Id))
    FROM  Users AS u
    LEFT JOIN  Bots  ON u.Id = Bots.UserId
    WHERE  Bots.UserId IS NULL;

Essentially, bots is a separate table of userIDs with some additional bot-specific columns. The query is finding out how many users are not bots.

Schema:

------------------------------------------------------------------------------+
| Bots  | CREATE TABLE `Bots` (
  `UserId` varchar(26) NOT NULL,
  `Description` text,
  `OwnerId` varchar(190) DEFAULT NULL,
  `LastIconUpdate` bigint DEFAULT NULL,
  ... truncated
  PRIMARY KEY (`UserId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

| Users | CREATE TABLE `Users` (
  `Id` varchar(26) NOT NULL,
  `CreateAt` bigint DEFAULT NULL,
  `UpdateAt` bigint DEFAULT NULL,
  `DeleteAt` bigint DEFAULT NULL,
  `Username` varchar(64) DEFAULT NULL,
  `Password` varchar(128) DEFAULT NULL,
  `AuthData` varchar(128) DEFAULT NULL,
  `AuthService` varchar(32) DEFAULT NULL,
  `Email` varchar(128) DEFAULT NULL,
  `EmailVerified` tinyint(1) DEFAULT NULL,
  `Nickname` varchar(64) DEFAULT NULL,
  `FirstName` varchar(64) DEFAULT NULL,
  `LastName` varchar(64) DEFAULT NULL,
  `Position` varchar(128) DEFAULT NULL,
   ... truncated
  PRIMARY KEY (`Id`),
  UNIQUE KEY `Username` (`Username`),
  UNIQUE KEY `AuthData` (`AuthData`),
  UNIQUE KEY `Email` (`Email`),
  KEY `idx_users_update_at` (`UpdateAt`),
  KEY `idx_users_create_at` (`CreateAt`),
  KEY `idx_users_delete_at` (`DeleteAt`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |

Users.Id is varchar(26) and a PK. For Bots, UserId is PK with same varchar(26). Ignore the distinct in the query for now. We don’t actually need that and the problem at hand is unrelated to it.

With that out of the way, let’s get into the problem. The system has around 100K rows in Users and 10 rows in Bots. Most of the times, the query finishes in couple of ms.

Here is an EXPLAIN output of the good case

Aggregate: count(distinct u.Id) (cost=71155.80 rows=1) (actual time=463.397..463.397 rows=1 loops=1)
    -> Filter: (bots.UserId is null) (cost=58476.70 rows=126791) (actual time=0.051..234.996 rows=131491 loops=1)
        -> Nested loop antijoin (cost=58476.70 rows=126791) (actual time=0.050..223.681 rows=131491 loops=1)
            -> Covering index scan on u using idx_users_update_at (cost=14099.85 rows=126791) (actual time=0.038..34.614 rows=131496 loops=1)
            -> Single-row covering index lookup on Bots using PRIMARY (UserId=u.Id) (cost=0.25 rows=1) (actual time=0.001..0.001 rows=0 loops=131496)

You can see that it converts the LEFT JOIN with IS NULL into an antijoin and takes around 400ms, and around 200ms is going into calculating the distinct. All good here.

But here is the case with the bad execution:

 Aggregate: count(distinct u.Id) (cost=71154.70 rows=1) (actual time=10298.826..10298.826 rows=1 loops=1)
    -> Filter: (bots.UserId is null) (cost=58475.80 rows=126789) (actual time=0.045..9853.564 rows=131489 loops=1) 
        -> Nested loop antijoin (cost=58475.80 rows=126789) (actual time=0.044..9840.607 rows=131489 loops=1) 
            -> Covering index scan on u using idx_users_update_at (cost=14099.65 rows=126789) (actual time=0.035..151.337 rows=131494 loops=1) 
            -> Single-row covering index lookup on Bots using PRIMARY (UserId=u.Id) (cost=0.25 rows=1) (actual time=0.074..0.074 rows=0 loops=131494) 

You can see that the execution plan is exactly the same, but somehow the nested loop antijoin is now taking 9.8s rather than 200ms: actual time=0.044..9840.607.

I have been scratching my head to find an explanation but unable to.

And even more interestingly, this is the query that actually fixes the problem: EXPLAIN ANALYZE SELECT COUNT(distinct(u.Id)) FROM Users AS u WHERE u.Id NOT IN (SELECT UserId from Bots);

---+
| -> Aggregate: count(distinct u.Id)  (cost=203038.61 rows=1) (actual time=339.482..339.482 rows=1 loops=1)
    -> Nested loop antijoin  (cost=119428.11 rows=836105) (actual time=0.170..103.207 rows=154346 loops=1)
        -> Covering index scan on u using idx_users_update_at  (cost=19095.51 rows=167221) (actual time=0.092..30.890 rows=154351 loops=1)
        -> Single-row index lookup on <subquery2> using <auto_distinct_key> (UserId=u.Id)  (actual time=0.000..0.000 rows=0 loops=154351)
            -> Materialize with deduplication  (cost=1.25..1.25 rows=5) (actual time=0.056..0.056 rows=5 loops=1)
                -> Filter: (Bots.UserId is not null)  (cost=0.75 rows=5) (actual time=0.019..0.030 rows=5 loops=1)
                    -> Covering index scan on Bots using PRIMARY  (cost=0.75 rows=5) (actual time=0.017..0.027 rows=5 loops=1)

Effectively, this is being translated into the same thing isn’t it? The sub-query gets materialized into a table, and then antijoined into Users again. But with this query, the runtime is always within the 300-400ms range and never regresses. What gives?

Is this a known quirk of MySQL that I am unaware of?

Hi @Agniva_De_Sarker

Could a massive update/insert be executed sometime before (or during) that query? I’ve seen that the query uses a secondary index based on the UpdateAt column. If a significant number of rows were modified, we could have a problem with the change buffer.

https://dev.mysql.com/doc/refman/5.7/en/innodb-change-buffer.html

We could also have a number of rows stored in the rollback segments if our operation runs with an old snapshot of the data due to an uncommitted large transaction.

Pep

Thanks for looking into this Pep.

Could a massive update/insert be executed sometime before (or during) that query?

I can’t say definitively but it’s unlikely something like that is happening. Especially because this is happening all the time.

And moreover, it doesn’t explain why when changing the query to “NOT IN” variant, everything is fine.

It’s just this single query, and there are no other queries part of this transaction.

We need more information to diagnose the issue if we discard the usual suspects. Are you using PMM to monitor the database? Look for spikes on any innodb metric.

Maybe you were hit by a regression introduced in 8.0.31.

Unfortunately this is happening on a client’s database. We haven’t been able to reproduce this in-house. So it’s difficult to ask them now for metrics for that time range given that the issue is resolved by using the new query.

But I should add that I have seen a similar issue with the same query structure of LEFT JOIN .. IS NULL with another customer as well. There too, converting it to the NOT IN version resolves the problem.

That’s why this feels like a core MySQL query performance issue. I just don’t understand how.