Hello all,
Can someone shed some light on the problem with the MDL conflict causing MySQL to shutdown node in the cluster?
Our enviroment is 3 PXC nodes with loadbalancing via ProxySQL:
node1 read-write
node2 read-only
node3 read-only
8.0.26-16.1 Percona XtraDB Cluster (GPL), Release rel16, Revision b141904, WSREP version 26.4.3
When a problem occurs:
When we execute mass ALTER tables (adding columns/keys/foreign keys) on read-write node1, other nodes may crash because of MDL conflicts.
node1 my.cnf
[mysqld]
datadir = /var/lib/mysql
tmpdir = /srv/mysql_tmpdir
internal_tmp_mem_storage_engine = MEMORY
user = mysql
default_storage_engine = InnoDB
binlog_format = ROW
skip-name-resolve = 1
sql_mode = ""
collation_server = utf8_unicode_ci
character_set_server = utf8
# WSREP #
wsrep_provider = /usr/lib/galera4/libgalera_smm.so
wsrep_provider_options = "gcache.size=1G;gcache.recover=yes;gcs.fc_limit=160;gcs.fc_factor=0.8;pc.weight=1;cert.log_conflicts=yes;"
wsrep_cluster_address = gcomm://10.0.1.10,10.0.2.10,10.0.3.10
wsrep_node_address = 10.0.1.10
wsrep_node_name = pxc-node1
wsrep_sst_donor = pxc-node3
wsrep_cluster_name = pxc-cluster
wsrep_sst_method = xtrabackup-v2
wsrep_retry_autocommit = 10
wsrep_applier_threads = 32
wsrep_certify_nonPK = 1
wsrep_applier_FK_checks = 0
wsrep_certification_rules = optimized
pxc_strict_mode = PERMISSIVE
pxc-encrypt-cluster-traffic = OFF
wsrep_ignore_apply_errors = 7
lock_wait_timeout = 7200
# LOGGING #
skip-log-bin
general_log = ON
general_log_file = /var/log/mysql/general.log
wsrep_log_conflicts = ON
log_error = /var/log/mysql/error.log
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 1
# CACHES AND LIMITS #
tmp_table_size = 32M
max_heap_table_size = 32M
max_connections = 1000
open_files_limit = 240010
table_definition_cache = 262144
table_open_cache = 4096
wait_timeout = 14400
sort_buffer_size = 2M
thread_cache_size = 80
thread_pool_size = 16
performance_schema_digests_size = 1048576
# INNODB #
innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 2G
innodb_flush_log_at_trx_commit = 2
innodb_file_per_table = 1
innodb_buffer_pool_size = 64G
innodb_thread_concurrency = 0
innodb_autoinc_lock_mode = 2
innodb_buffer_pool_instances = 64
innodb_read_io_threads = 16
innodb_write_io_threads = 16
innodb_io_capacity = 200
innodb_open_files = 262144
innodb_print_all_deadlocks = 1
[sst]
progress = 1
compressor='zstd -2 -T6'
decompressor='zstd -d -T6'
backup_threads=6
[xtrabackup]
parallel=6
pxc-node2 mysql-error.log
2022-03-16T17:34:26.082428Z 196328 [Note] [MY-000000] [WSREP] MDL conflict db=xxx_dbname table=xxx_client ticket=10 solved by abort
2022-03-16T17:34:27.256059Z 199044 [Note] [MY-000000] [WSREP] MDL conflict db=xxx_dbname table=xxx_client ticket=10 solved by abort
2022-03-16T17:34:29.284658Z 14 [Note] [MY-000000] [WSREP] MDL BF-BF conflict
schema: xxx_dbname
request: (thd-tid:14 seqno:119068119 exec-mode:toi, query-state:exec, conflict-state:committed)
cmd-code:3 3 query:ALTER TABLE `xxx_common_client` ADD `phone_hash` varchar(255) UNIQUE)
granted: (thd-tid:18 seqno:119068125 exec-mode:high priority, query-state:exec, conflict-state:committing)
cmd-code:3 167 query:(null))
2022-03-16T17:34:29.284706Z 14 [Note] [MY-000000] [WSREP] MDL ticket: type: shared write, space: TABLE, db: xxx_dbname, name: xxx_client
2022-03-16T17:34:29.284741Z 14 [ERROR] [MY-010119] [Server] Aborting
2022-03-16T17:34:29.284755Z 14 [Note] [MY-000000] [WSREP] Initiating SST cancellation
2022-03-16T17:34:31.290194Z 14 [Warning] [MY-000000] [Server] /usr/sbin/mysqld: Forcing close of thread 210453 user: 'xxx_dbname'
2022-03-16T17:34:31.290722Z 2 [Note] [MY-000000] [WSREP] rollbacker thread exiting 2
2022-03-16T17:34:31.290801Z 14 [Note] [MY-000000] [WSREP] Server status change synced -> disconnecting
2022-03-16T17:34:31.290829Z 14 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2022-03-16T17:34:31.290899Z 14 [Note] [MY-000000] [Galera] Closing send monitor...
2022-03-16T17:34:31.290921Z 14 [Note] [MY-000000] [Galera] Closed send monitor.
2022-03-16T17:34:31.290984Z 14 [Note] [MY-000000] [Galera] gcomm: terminating thread
2022-03-16T17:34:31.291028Z 14 [Note] [MY-000000] [Galera] gcomm: joining thread
2022-03-16T17:34:31.291822Z 14 [Note] [MY-000000] [Galera] gcomm: closing backend
2022-03-16T17:34:31.293414Z 14 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,a8594bf1-864a,11)
memb {
d57e7da2-86e4,0
}
joined {
}
left {
}
partitioned {
a8594bf1-864a,0
ea59cd1a-a3b1,0
}
)
2022-03-16T17:34:31.293492Z 14 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2022-03-16T17:34:31.293519Z 14 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2022-03-16T17:34:31.293782Z 14 [Note] [MY-000000] [Galera] gcomm: closed
2022-03-16T17:34:31.293951Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2022-03-16T17:34:31.294090Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [128, 160]
2022-03-16T17:34:31.294113Z 0 [Note] [MY-000000] [Galera] Received NON-PRIMARY.
2022-03-16T17:34:31.294130Z 0 [Note] [MY-000000] [Galera] Shifting SYNCED -> OPEN (TO: 119068160)
2022-03-16T17:34:31.294163Z 0 [Note] [MY-000000] [Galera] New SELF-LEAVE.
2022-03-16T17:34:31.294204Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [0, 0]
2022-03-16T17:34:31.294226Z 0 [Note] [MY-000000] [Galera] Received SELF-LEAVE. Closing connection.
2022-03-16T17:34:31.294244Z 0 [Note] [MY-000000] [Galera] Shifting OPEN -> CLOSED (TO: 119068160)
2022-03-16T17:34:31.294265Z 0 [Note] [MY-000000] [Galera] RECV thread exiting 0: Success
2022-03-16T17:34:31.294731Z 14 [Note] [MY-000000] [Galera] recv_thread() joined.
2022-03-16T17:34:31.294804Z 14 [Note] [MY-000000] [Galera] Closing replication queue.
2022-03-16T17:34:31.294828Z 14 [Note] [MY-000000] [Galera] Closing slave action queue.
Maybe there is some magic wsrep option that will prevent a nodes from dropping out of a cluster during mass ALTER and MDL conflicts? Or are we wishing for a weird and contradictory ACID?
Or maybe we need some tuning wsrep flow control?
Thank you all in advance for your advice and tips.