Not the answer you need?
Register and ask your own question!

One node gets all queries stuck and collapse due to too many connections

jroctaniojroctanio EntrantCurrent User Role Beginner
Environment:
3 node percona cluster 5.7.19-29.22-3
Ubuntu 16.04 LTS

Every certain time, one of the nodes gets all queries stuck and collapse due to too many connections. It is not always the same node and it is not always the same queries.

The time between this events may vary, but we have found them between 6 hours and 10 hours.

The only solution for us is to kill the mysqld process and start it again because shutting down the server gracefully is not possible.

I attach relevant graphs during the event.

Anyone with the same problem or an idea of how this could be solved?

Kind Regards,
Jose

Comments

  • beautivilebeautivile Entrant Current User Role Beginner
    You are not alone in this. There is another post (https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster/11056-too-many-connections-brings-down-cluster/page2) that may relate to your issues. We have ended up having to setup monitoring to alert as soon as connections go over 125% of normal so we can restart the process or kill the processes and resync.
  • jroctaniojroctanio Entrant Current User Role Beginner
    At first we thought on a solution like yours but in the end we've had to replace pxc due to this problem as this is not acceptable in a production environment. We'll keep an eye on these threads for an eventual solution of this bug.

    Thank you for you answer beautivile .
  • krunalbauskarkrunalbauskar Advisor Inactive User Role Novice
    I presume you are hitting following issue as symptoms are same

    commit 35cee763032246f809a84b92cbe014dbfa081972
    Author: Krunal Bauskar <[email protected]>
    Date: Fri Oct 13 21:42:08 2017 +0530

    - PXC#877: PXC hangs on what looks like an internal deadlock

    - PXC protects wsrep_xxxx state variables through a native mutex
    name LOCK_wsrep_thd.

    - This mutex should be held only while checking the state
    and should be released immediately.

    - In the said bug, this mutex was being held during complete
    cleanup. Other thread with conflicting lock, booted up
    at same time and tried and invoked kill action on the said
    thread. For invoking kill action it acquired transaction mutex
    and LOCK_wsrep_thd in respective order. Said thread already
    had LOCK_wsrep_thd and tried acquring transaction mutex
    leading deadlock.


    BTW the issue was introduced in 5.7.17 and present in 5.7.19. (Will be solved in 5.7.20).
    If you are experiencing it on release before 5.7.17 or 5.6.x then this is not the issue.
  • krunalbauskarkrunalbauskar Advisor Inactive User Role Novice
    jroctanio,

    PXC will miss you. May be you can give it another try with 5.7.20.


    Also, would love to hear any other issues you may face in future. Will try out best to resolve them.
  • szutaszuta Entrant Current User Role Beginner
    Hi krunalbauskar,
    Do you have an idea under which circumstances this bug might occur? We ran into it in our production instance (5.7.19-29.22-3 on Ubuntu 16.04 LTS) with a 3 node cluster several times yesterday, with symptoms matching those of the original post. Our solution was to revert back to 5.7.16-27.19.

    For reference, we'd see errors like these:
    2017-11-28T01:45:45.840545Z 0 [Warning] InnoDB: A long semaphore wait:
    --Thread 140037937301248 has waited at lock0lock.cc line 4427 for 365.00 seconds the semaphore:
    Mutex at 0x7f967c634068, Mutex LOCK_SYS created lock0lock.cc:451, lock var 1
    
    2017-11-28T01:45:45.840563Z 0 [Warning] InnoDB: A long semaphore wait:
    --Thread 140038020478720 has waited at lock0lock.cc line 4427 for 365.00 seconds the semaphore:
    Mutex at 0x7f967c634068, Mutex LOCK_SYS created lock0lock.cc:451, lock var 1
    
    --Thread 140038339815168 has waited at ibuf0ibuf.cc line 2531 for 365.00 seconds the semaphore:
    S-lock on RW-latch at 0x7f9663ac0870 created in file buf0buf.cc line 1456
    a writer (thread id 140032994010880) has reserved it in mode  exclusive
    number of readers 0, waiters flag 1, lock_word: 0
    Last time read locked in file ibuf0ibuf.cc line 4568
    Last time write locked in file /mnt/workspace/percona-xtradb-cluster-5.7-debian-binary/label_exp/ubuntu-xenial-64bit/percona-xtradb-cluster-5.7-5.7.19-29.22/storage/innobase/ibuf/ibuf0ibuf.cc line 4568
    

    I chose that excerpt because thread 140038339815168 was he main thread
    Process ID=181443, Main thread ID=140038339815168, state: doing insert buffer merge
    Number of rows inserted 39760179, updated 55926900, deleted 38526, read 801630968
    693.77 inserts/s, 0.00 updates/s, 0.00 deletes/s, 702.83 reads/s
    

    I initially hypothesized that an increased load on the system after running for about a week in production brought to light some concurrency issues. Specifically, after upgrading from 5.6 to 5.7 we removed the innodb_locks_unsafe_for_binlog setting. When the problem was occurring, pt-stalk showed information like this:
    *************************** 811. row ***************************
                        trx_id: 38964706
                     trx_state: RUNNING
                   trx_started: 2017-11-27 22:41:45
         trx_requested_lock_id: NULL
              trx_wait_started: NULL
                    trx_weight: 2
           trx_mysql_thread_id: 16964
                     trx_query: INSERT INTO mdl_sessions (state,sid,sessdata,userid,timemodified,timecreated,lastip,firstip) VALUES('0','xxxxxxx',NULL,'0','1511844105','1511844105','xxx.xxx.xxx.xxx','xxx.xxx.xxx.xxx')
           trx_operation_state: NULL
             trx_tables_in_use: 1
             trx_tables_locked: 1
              trx_lock_structs: 1
         trx_lock_memory_bytes: 1136
               trx_rows_locked: 0
             trx_rows_modified: 1
       trx_concurrency_tickets: 0
           trx_isolation_level: REPEATABLE READ
             trx_unique_checks: 1
        trx_foreign_key_checks: 1
    trx_last_foreign_key_error: NULL
     trx_adaptive_hash_latched: 0
     trx_adaptive_hash_timeout: 0
              trx_is_read_only: 0
    trx_autocommit_non_locking: 0
    *************************** 1. row ***************************
        lock_id: 38964255:7131:449:75
    lock_trx_id: 38964255
      lock_mode: X
      lock_type: RECORD
     lock_table: `moodle`.`mdl_sessions`
     lock_index: PRIMARY
     lock_space: 7131
      lock_page: 449
       lock_rec: 75
      lock_data: 224357301
    

    Since all of our writes (and most reads) go to one node, I wasn't expecting the isolation level to cause problems. I still changed it to READ-COMMITTED via
    transaction_isolation = READ-COMMITTED
    

    Since I've made some changes before discovering this thread and reverting, our new configuration looks like this.
    [sst]
    inno-apply-opts = "--use-memory=40G"
    inno-backup-opts = "--parallel=20 --stream=xbstream --compress --compress-threads=20"
    [mysqld]
    user = mysql
    pid-file = /var/run/mysqld/mysqld.pid
    socket = /var/run/mysqld/mysqld.sock
    port = 3306
    basedir = /usr
    tmpdir = /tmp
    lc-messages-dir = /usr/share/mysql
    datadir = /var/lib/mysql/data
    innodb = FORCE
    innodb_use_native_aio = 0
    innodb_file_format = Barracuda
    innodb_file_per_table = 1
    innodb_large_prefix = ON
    character-set-server = utf8mb4
    collation-server = utf8mb4_unicode_ci
    skip-character-set-client-handshake = 1
    key-buffer-size = 32M
    max-heap-table-size = 48M
    max-allowed-packet = 16M
    tmp-table-size = 48M
    query-cache-type = 0
    query-cache-size = 0
    max-connections = 2000
    thread-cache-size = 50
    open-files-limit = 65535
    table-definition-cache = 4096
    table-open-cache = 8192
    innodb-log-files-in-group = 2
    innodb-log-file-size = 2G
    innodb-flush-log-at-trx-commit = 2
    innodb-file-per-table = 1
    innodb-buffer-pool-size = 216G
    innodb-buffer-pool-instances = 8
    innodb-io-capacity = 6000
    innodb-read-io-threads = 15
    innodb-write-io-threads = 15
    innodb_file_format = Barracuda
    innodb_doublewrite = 1
    innodb_adaptive_hash_index = False
    pxc_strict_mode = PERMISSIVE
    transaction_isolation = READ-COMMITTED
    innodb-thread-concurrency = 32
    innodb_flush_log_at_timeout = 2
    innodb-thread-concurrency = 0
    log-error = /var/log/mysql/mysql-error.log
    log-queries-not-using-indexes = 0
    slow-query-log = 0
    slow-query-log-file = /var/log/mysql/mysql-slow.log
    long-query-time = 2
    wsrep_provider = /usr/lib/libgalera_smm.so
    wsrep_node_address = 172.22.xy.10
    wsrep_cluster_name = prod
    wsrep_node_name = sql.ppsb.learn.illinois.edu
    wsrep_cluster_address = gcomm://172.22.xy.10,172.22.xx.10,172.22.xz.10
    wsrep_provider_options = "gcache.size=4G;"
    wsrep_retry_autocommit = 0
    wsrep_sst_method = xtrabackup-v2
    wsrep_slave_threads = 4
    wsrep_sst_auth = sstuser:xxxxxxx
    innodb-autoinc-lock-mode = 2
    log-bin = mysql-binlog
    binlog_format = ROW
    expire-logs-days = 2
    sync-binlog = 1
    innodb-flush-neighbors = 0
    innodb_checksum_algorithm = crc32
    innodb_log_group_home_dir = /var/lib/mysql/innodb-logs
    server-id = 7
    innodb_numa_interleave = 0
    [client]
    default-character-set = utf8mb4
    [mysql]
    default-character-set = utf8mb4
    

    Initially we didn't have a double buffer (running on zfs), pxc_strict_mode was set to default, and innodb_numa_interleave was set to 1.

    Our servers are PowerEdge R630's with 256GB of RAM and an SSD for /var/lib/mysql, with /var/lib/mysql/data having a zfs recordsize of 16k and /var/lib/mysq/innodb-logs having a recordsize of 128k. Each server has 2 Intel Xeon E5-2630 v4 processors.
  • krunalbauskarkrunalbauskar Advisor Inactive User Role Novice
    I presume you are hitting the same issue (unfortunately it was due to missing mutex-unlock). There is another issue that is related to innodb-thread-concurrency but I already see you set it to 0 after setting it to 32.
    innodb-thread-concurrency = 32 innodb_flush_log_at_timeout = 2 innodb-thread-concurrency = 0
  • szutaszuta Entrant Current User Role Beginner
    Oops! My bad. You're right that my current configuration specifies both in that order. It sounds like in order to prevent the bug from occurring in 5.7.16 I should have it set to 0?
  • dmitry.pronyaevdmitry.pronyaev Entrant Inactive User Role Beginner
    Hello! We have the same issue on Debian 9.2 with percona-xtradb-cluster 5.7.19.22-3 compilled with the fix of https://jira.percona.com/browse/PXC-847 . It work fine about a week and we had "Too many connections" ones again (without any high load, at night). How can we solve this problem?
Sign In or Register to comment.

MySQL, InnoDB, MariaDB and MongoDB are trademarks of their respective owners.
Copyright ©2005 - 2020 Percona LLC. All rights reserved.