Hello all,
Can someone shed some light on the problem with the MDL conflict causing MySQL to shutdown node in the cluster?
Our enviroment is 3 PXC nodes with loadbalancing via ProxySQL:
node1 read-write
node2 read-only
node3 read-only
8.0.26-16.1 Percona XtraDB Cluster (GPL), Release rel16, Revision b141904, WSREP version 26.4.3
When a problem occurs:
When we execute mass ALTER tables (adding columns/keys/foreign keys) on read-write node1, other nodes may crash because of MDL conflicts.
node1 my.cnf
[mysqld]
datadir                 = /var/lib/mysql
tmpdir                  = /srv/mysql_tmpdir
internal_tmp_mem_storage_engine = MEMORY
user                    = mysql
default_storage_engine  = InnoDB
binlog_format           = ROW
skip-name-resolve	= 1
sql_mode                = ""
collation_server        = utf8_unicode_ci
character_set_server    = utf8
# WSREP #
wsrep_provider          = /usr/lib/galera4/libgalera_smm.so
wsrep_provider_options  = "gcache.size=1G;gcache.recover=yes;gcs.fc_limit=160;gcs.fc_factor=0.8;pc.weight=1;cert.log_conflicts=yes;"
wsrep_cluster_address   = gcomm://10.0.1.10,10.0.2.10,10.0.3.10
wsrep_node_address      = 10.0.1.10
wsrep_node_name         = pxc-node1
wsrep_sst_donor         = pxc-node3
wsrep_cluster_name      = pxc-cluster
wsrep_sst_method        = xtrabackup-v2
wsrep_retry_autocommit  = 10
wsrep_applier_threads     = 32
wsrep_certify_nonPK     = 1
wsrep_applier_FK_checks	= 0
wsrep_certification_rules = optimized
pxc_strict_mode         = PERMISSIVE
pxc-encrypt-cluster-traffic = OFF
wsrep_ignore_apply_errors = 7
lock_wait_timeout	= 7200
# LOGGING #
skip-log-bin
general_log             = ON
general_log_file        = /var/log/mysql/general.log
wsrep_log_conflicts     = ON
log_error               = /var/log/mysql/error.log
slow_query_log          = 1
slow_query_log_file     = /var/log/mysql/slow.log
long_query_time         = 1
# CACHES AND LIMITS #
tmp_table_size          = 32M
max_heap_table_size     = 32M
max_connections         = 1000
open_files_limit        = 240010
table_definition_cache  = 262144
table_open_cache        = 4096
wait_timeout            = 14400
sort_buffer_size        = 2M
thread_cache_size       = 80
thread_pool_size        = 16
performance_schema_digests_size = 1048576
# INNODB #
innodb_flush_method             = O_DIRECT
innodb_log_files_in_group       = 2
innodb_log_file_size            = 2G
innodb_flush_log_at_trx_commit  = 2
innodb_file_per_table           = 1
innodb_buffer_pool_size         = 64G
innodb_thread_concurrency       = 0
innodb_autoinc_lock_mode        = 2
innodb_buffer_pool_instances    = 64
innodb_read_io_threads          = 16
innodb_write_io_threads         = 16
innodb_io_capacity		= 200
innodb_open_files               = 262144
innodb_print_all_deadlocks      = 1
[sst]
progress = 1
compressor='zstd -2 -T6'
decompressor='zstd -d -T6'
backup_threads=6
[xtrabackup]
parallel=6
pxc-node2 mysql-error.log
2022-03-16T17:34:26.082428Z 196328 [Note] [MY-000000] [WSREP] MDL conflict db=xxx_dbname table=xxx_client ticket=10 solved by abort
2022-03-16T17:34:27.256059Z 199044 [Note] [MY-000000] [WSREP] MDL conflict db=xxx_dbname table=xxx_client ticket=10 solved by abort
2022-03-16T17:34:29.284658Z 14 [Note] [MY-000000] [WSREP] MDL BF-BF conflict
schema:  xxx_dbname
request: (thd-tid:14    seqno:119068119         exec-mode:toi, query-state:exec, conflict-state:committed)
          cmd-code:3 3  query:ALTER TABLE `xxx_common_client` ADD `phone_hash` varchar(255) UNIQUE)
granted: (thd-tid:18    seqno:119068125         exec-mode:high priority, query-state:exec, conflict-state:committing)
          cmd-code:3 167        query:(null))
2022-03-16T17:34:29.284706Z 14 [Note] [MY-000000] [WSREP] MDL ticket: type: shared write, space: TABLE, db: xxx_dbname, name: xxx_client
2022-03-16T17:34:29.284741Z 14 [ERROR] [MY-010119] [Server] Aborting
2022-03-16T17:34:29.284755Z 14 [Note] [MY-000000] [WSREP] Initiating SST cancellation
2022-03-16T17:34:31.290194Z 14 [Warning] [MY-000000] [Server] /usr/sbin/mysqld: Forcing close of thread 210453  user: 'xxx_dbname'
2022-03-16T17:34:31.290722Z 2 [Note] [MY-000000] [WSREP] rollbacker thread exiting 2
2022-03-16T17:34:31.290801Z 14 [Note] [MY-000000] [WSREP] Server status change synced -> disconnecting
2022-03-16T17:34:31.290829Z 14 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2022-03-16T17:34:31.290899Z 14 [Note] [MY-000000] [Galera] Closing send monitor...
2022-03-16T17:34:31.290921Z 14 [Note] [MY-000000] [Galera] Closed send monitor.
2022-03-16T17:34:31.290984Z 14 [Note] [MY-000000] [Galera] gcomm: terminating thread
2022-03-16T17:34:31.291028Z 14 [Note] [MY-000000] [Galera] gcomm: joining thread
2022-03-16T17:34:31.291822Z 14 [Note] [MY-000000] [Galera] gcomm: closing backend
2022-03-16T17:34:31.293414Z 14 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,a8594bf1-864a,11)
memb {
        d57e7da2-86e4,0
        }
joined {
        }
left {
        }
partitioned {
        a8594bf1-864a,0
        ea59cd1a-a3b1,0
        }
)
2022-03-16T17:34:31.293492Z 14 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2022-03-16T17:34:31.293519Z 14 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2022-03-16T17:34:31.293782Z 14 [Note] [MY-000000] [Galera] gcomm: closed
2022-03-16T17:34:31.293951Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2022-03-16T17:34:31.294090Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [128, 160]
2022-03-16T17:34:31.294113Z 0 [Note] [MY-000000] [Galera] Received NON-PRIMARY.
2022-03-16T17:34:31.294130Z 0 [Note] [MY-000000] [Galera] Shifting SYNCED -> OPEN (TO: 119068160)
2022-03-16T17:34:31.294163Z 0 [Note] [MY-000000] [Galera] New SELF-LEAVE.
2022-03-16T17:34:31.294204Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [0, 0]
2022-03-16T17:34:31.294226Z 0 [Note] [MY-000000] [Galera] Received SELF-LEAVE. Closing connection.
2022-03-16T17:34:31.294244Z 0 [Note] [MY-000000] [Galera] Shifting OPEN -> CLOSED (TO: 119068160)
2022-03-16T17:34:31.294265Z 0 [Note] [MY-000000] [Galera] RECV thread exiting 0: Success
2022-03-16T17:34:31.294731Z 14 [Note] [MY-000000] [Galera] recv_thread() joined.
2022-03-16T17:34:31.294804Z 14 [Note] [MY-000000] [Galera] Closing replication queue.
2022-03-16T17:34:31.294828Z 14 [Note] [MY-000000] [Galera] Closing slave action queue.
Maybe there is some magic wsrep option that will prevent a nodes from dropping out of a cluster during mass ALTER and MDL conflicts? Or are we wishing for a weird and contradictory ACID?
Or maybe we need some tuning wsrep flow control?
Thank you all in advance for your advice and tips.
