Users cannot connect to db when threads number is greater than thread_pool_size * (thread_pool_oversubscribe + 1)

I found that users cannot connect to db when current threads count is greater than thread_pool_size * (thread_pool_oversubscribe + 1), I think it is unreasonable.

server version(debug build from source code, commit id is 646a7c):

8.0.37-29-debug

Repeat process:

step 1: table and data

mysql> use testdb
Database changed

mysql> show create table t\G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
1 row in set (0.00 sec)

mysql> select * from t;
+----+
| id |
+----+
|  1 |
|  2 |
|  3 |
|  4 |
|  5 |
+----+
5 rows in set (0.00 sec)

step 2: variables

mysql> show variables like '%thread_pool%';
+-------------------------------+--------------+
| Variable_name                 | Value        |
+-------------------------------+--------------+
| thread_pool_high_prio_mode    | transactions |
| thread_pool_high_prio_tickets | 4294967295   |
| thread_pool_idle_timeout      | 60           |
| thread_pool_max_threads       | 100          |
| thread_pool_oversubscribe     | 10           |
| thread_pool_size              | 1            |
| thread_pool_stall_limit       | 30           |
+-------------------------------+--------------+
7 rows in set (0.01 sec)

mysql> show variables like '%thread_hand%';
+-----------------+-----------------+
| Variable_name   | Value           |
+-----------------+-----------------+
| thread_handling | pool-of-threads |
+-----------------+-----------------+
1 row in set (0.00 sec)

strong textstep 3: root login and lock table testdb.t:

$mysql -c -A -uroot -S /u01/my3333/data/tmp/mysql.sock
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 9
Server version: 8.0.37-29-debug Source distribution

Copyright (c) 2009-2024 Percona LLC and/or its affiliates
Copyright (c) 2000, 2024, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> use testdb
Database changed
mysql> lock table t write;
Query OK, 0 rows affected (0.02 sec)

step 4: run the following script to create 30 connections to db and do the select on table t

$cat test.sh
#!/bin/bash

for i in `seq 1 30`
do
    sleep 0.2
    nohup mysql -A -utest_user -pabctest -h 127.0.0.1 -P3333 -e "select id from testdb.t" > $i.log &
done

$sh test.sh
nohup: redirecting stderr to stdout
nohup: redirecting stderr to stdout
nohup: redirecting stderr to stdout
...
...
...

step 5: show processlist, we can see after 11 connections are running the select SQL and waiting for table metadata lock, the next 19 connections cannont connect because thread pool threads count has exceeded thread_pool_size * (thread_pool_oversubscribe + 1)

mysql> show processlist;
+----+----------------------+-----------------+--------+---------+------+---------------------------------+-------------------------+---------+-----------+---------------+
| Id | User                 | Host            | db     | Command | Time | State                           | Info                    | Time_ms | Rows_sent | Rows_examined |
+----+----------------------+-----------------+--------+---------+------+---------------------------------+-------------------------+---------+-----------+---------------+
|  9 | root                 | 127.0.0.1:29864 | testdb | Query   |    0 | init                            | show processlist        |       0 |         0 |             0 |
| 10 | test_user            | 127.0.0.1:29954 | NULL   | Query   |   20 | Waiting for table metadata lock | select id from testdb.t |   19635 |         0 |             0 |
| 11 | test_user            | 127.0.0.1:29956 | NULL   | Query   |   20 | Waiting for table metadata lock | select id from testdb.t |   19436 |         0 |             0 |
| 12 | test_user            | 127.0.0.1:29958 | NULL   | Query   |   19 | Waiting for table metadata lock | select id from testdb.t |   19234 |         0 |             0 |
| 13 | test_user            | 127.0.0.1:29960 | NULL   | Query   |   19 | Waiting for table metadata lock | select id from testdb.t |   19031 |         0 |             0 |
| 14 | test_user            | 127.0.0.1:29962 | NULL   | Query   |   19 | Waiting for table metadata lock | select id from testdb.t |   18830 |         0 |             0 |
| 15 | test_user            | 127.0.0.1:29966 | NULL   | Query   |   19 | Waiting for table metadata lock | select id from testdb.t |   18627 |         0 |             0 |
| 16 | test_user            | 127.0.0.1:29968 | NULL   | Query   |   19 | Waiting for table metadata lock | select id from testdb.t |   18426 |         0 |             0 |
| 17 | test_user            | 127.0.0.1:29970 | NULL   | Query   |   18 | Waiting for table metadata lock | select id from testdb.t |   18226 |         0 |             0 |
| 18 | test_user            | 127.0.0.1:29972 | NULL   | Query   |   18 | Waiting for table metadata lock | select id from testdb.t |   18007 |         0 |             0 |
| 19 | test_user            | 127.0.0.1:29974 | NULL   | Query   |   18 | Waiting for table metadata lock | select id from testdb.t |   17816 |         0 |             0 |
| 20 | test_user            | 127.0.0.1:29978 | NULL   | Query   |   18 | Waiting for table metadata lock | select id from testdb.t |   17619 |         0 |             0 |
| 21 | unauthenticated user | connecting host | NULL   | Connect |   18 | login                           | NULL                    |   17419 |         0 |             0 |
| 22 | unauthenticated user | connecting host | NULL   | Connect |   17 | login                           | NULL                    |   17218 |         0 |             0 |
| 23 | unauthenticated user | connecting host | NULL   | Connect |   17 | login                           | NULL                    |   17013 |         0 |             0 |
| 24 | unauthenticated user | connecting host | NULL   | Connect |   17 | login                           | NULL                    |   16812 |         0 |             0 |
| 25 | unauthenticated user | connecting host | NULL   | Connect |   17 | login                           | NULL                    |   16609 |         0 |             0 |
| 26 | unauthenticated user | connecting host | NULL   | Connect |   17 | login                           | NULL                    |   16408 |         0 |             0 |
| 27 | unauthenticated user | connecting host | NULL   | Connect |   16 | login                           | NULL                    |   16207 |         0 |             0 |
| 28 | unauthenticated user | connecting host | NULL   | Connect |   16 | login                           | NULL                    |   16006 |         0 |             0 |
| 29 | unauthenticated user | connecting host | NULL   | Connect |   16 | login                           | NULL                    |   15802 |         0 |             0 |
| 30 | unauthenticated user | connecting host | NULL   | Connect |   16 | login                           | NULL                    |   15603 |         0 |             0 |
| 31 | unauthenticated user | connecting host | NULL   | Connect |   16 | login                           | NULL                    |   15384 |         0 |             0 |
| 32 | unauthenticated user | connecting host | NULL   | Connect |   15 | login                           | NULL                    |   15193 |         0 |             0 |
| 33 | unauthenticated user | connecting host | NULL   | Connect |   15 | login                           | NULL                    |   14992 |         0 |             0 |
| 34 | unauthenticated user | connecting host | NULL   | Connect |   15 | login                           | NULL                    |   14791 |         0 |             0 |
| 35 | unauthenticated user | connecting host | NULL   | Connect |   15 | login                           | NULL                    |   14588 |         0 |             0 |
| 36 | unauthenticated user | connecting host | NULL   | Connect |   15 | login                           | NULL                    |   14388 |         0 |             0 |
| 37 | unauthenticated user | connecting host | NULL   | Connect |   14 | login                           | NULL                    |   14187 |         0 |             0 |
| 38 | unauthenticated user | connecting host | NULL   | Connect |   14 | login                           | NULL                    |   13986 |         0 |             0 |
| 39 | unauthenticated user | connecting host | NULL   | Connect |   14 | login                           | NULL                    |   13780 |         0 |             0 |
+----+----------------------+-----------------+--------+---------+------+---------------------------------+-------------------------+---------+-----------+---------------+
31 rows in set, 1 warning (0.00 sec)

The root cause of the above issue is that connection request in thread pool will be always put in the normal priority queue, however, when the following conditions are true, the reuqests in the normal priority queue will never be handled.

  • Condition 1: variable of thread_pool_high_prio_mode is transactions
  • Condition 2: too many busy threads are running, that is count of threads is greater than thread_pool_size * (thread_pool_oversubscribe + 1)
(gdb) b queue_put
Breakpoint 1 at 0x39f0d22: file /home/admin/percona-code/percona-server/sql/threadpool_unix.cc, line 983.
(gdb) c
Continuing.
[Thread 0x7fc08d5dc640 (LWP 10557) exited]
[New Thread 0x7fc08d5dc640 (LWP 13850)]

Thread 1 "mysqld" hit Breakpoint 1, queue_put (thread_group=0x8d81c00 <all_groups>, connection=0x9d635a0) at /home/admin/percona-code/percona-server/sql/threadpool_unix.cc:983
983   DBUG_ENTER("queue_put");
(gdb) bt
#0  queue_put (thread_group=0x8d81c00 <all_groups>, connection=0x9d635a0) at /home/admin/percona-code/percona-server/sql/threadpool_unix.cc:983
#1  0x00000000039f160f in Thread_pool_connection_handler::add_connection (this=0x98fb110, channel_info=0x9ddb7e0)
    at /home/admin/percona-code/percona-server/sql/threadpool_unix.cc:1236
#2  0x0000000003b881d3 in Connection_handler_manager::process_new_connection (this=0x990eed0, channel_info=0x9ddb7e0)
    at /home/admin/percona-code/percona-server/sql/conn_handler/connection_handler_manager.cc:281
#3  0x000000000360d04a in Connection_acceptor<Mysqld_socket_listener>::connection_event_loop (this=0x9a658c0)
    at /home/admin/percona-code/percona-server/sql/conn_handler/connection_acceptor.h:66
#4  0x00000000036000b4 in mysqld_main (argc=414, argv=0x987f280) at /home/admin/percona-code/percona-server/sql/mysqld.cc:8693
#5  0x000000000337efc2 in main (argc=11, argv=0x7ffda25654b8) at /home/admin/percona-code/percona-server/sql/main.cc:26
/*
  Add work to the queue. Maybe wake a worker if they all sleep.

  Currently, this function is only used when new connections need to
  perform login (this is done in worker threads).

*/
static void queue_put(thread_group_t *thread_group, connection_t *connection) {
  DBUG_ENTER("queue_put");

  mysql_mutex_lock(&thread_group->mutex);
  connection->tickets = connection->thd->variables.threadpool_high_prio_tickets;
  thread_group->queue.push_back(connection);

  if (thread_group->active_thread_count == 0)
    wake_or_create_thread(thread_group, connection->thd->is_admin_connection());

  mysql_mutex_unlock(&thread_group->mutex);

  DBUG_VOID_RETURN;
}

I don’t know why the reuqests in the normal priority queue will never be handled if the above two conditions are met. Connections in client are normally maintained in connection pool(e.g. druid), if connections cannot be got after some time(in jdbc, it is the jdbc url connection parameter connectTimeout), client will give up the connection and try to get a new one. However, the connection does occupy the connection number of db, function Connection_handler_manager::process_new_connection will first increment db connection and then add the connection request to normal priority queue. If client connection pool continue doing so, db connections will be exhausted and finally report error “Too many connections”. If this happens, db requests can only be recovered by admin users connected by admin address, that will be troublesome and will affect our user business.

void Connection_handler_manager::process_new_connection(
    Channel_info *channel_info) {
  if (connection_events_loop_aborted() ||
      !check_and_incr_conn_count(channel_info->is_admin_connection())) {
    channel_info->send_error_and_close_channel(ER_CON_COUNT_ERROR, 0, true);
    sql_print_warning("%s", ER_DEFAULT(ER_CON_COUNT_ERROR));
    delete channel_info;
    return;
  }

  if (m_connection_handler->add_connection(channel_info)) {
    inc_aborted_connects();
    delete channel_info;
  }
}

I think the connection request should always be put to the high priority queue(or a new queue, can be called connection queue), and if we have the configuration thread_pool_high_prio_mode = statements, we will never run out of db connections because all SQL requests will be put to the hign priority queue and can be handled by threads even thread number has exceeded thread_pool_size * (thread_pool_oversubscribe + 1).

Hey @trikker,
Since you have a repeatable test case, I suggest you open a bug at https://jira.percona.com/ Our internals engineers can then take a look.

ok, created, thanks, link: [PS-9452] - Percona JIRA

@matthewb No one replied to the issue [PS-9452] - Percona JIRA yet, could anyone please help me? Thanks.

Hey @trikker,
Percona is not a 1000 person company. Our engineering team looks at tickets in terms of priority, and then also our QA team needs to verify the issue. Be patient and someone will get to get. If you require more immediate resolution, you can engage in a support contract which would allow for bug escalation.