Hi.
We are trying to implement a PXC, and came across a strange problem. The tests were conducted on a bunch PXC + zabbix.
Configuration Server (3 nodes):
First node (10.10.92.3, neon):
CPU: 2 X Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Memory: 96Gb
Storage: 3 SSD drives Samsung 840 Pro 512Gb (with LVM)
Second node (10.10.91.4, natrium):
CPU: 2 X Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Memory: 96Gb
Storage: 3 SSD drives Samsung 840 Pro 512Gb (with LVM)
Third node (10.10.92.2, blackbird):
CPU: 2 X Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
Memory: 96Gb
Storage: 3 SSD drives Samsung 840 Pro 512Gb (with LVM)
All nodes are connected with 10G network interface directly!
At the time of testing, zabbix is only connected to the third node (without balancing).
What’s going on:
Immediately after the launch zabbix, everything works correctly. Approximately, after 8-12 hours the problems begin. On one of the nodes (different one each time) significantly rises the load on the CPU (system time). So, it is not even possible to connect via SSH.
When this happens, disconnecting zabbix and other servers does not solve the problem, the load on the CPU is not reduced.
The last time there was a problem with the second server. Other nodes just lose connection with the problematic one. Here is the logs from third node: see attachment ‘log_3_node.txt’
Log out of the problem of the server: see attachment ‘log_2_node.txt’
We use: Server version: 5.6.20-68.0-56-log Percona XtraDB Cluster (GPL), Release rel68.0, Revision 888, WSREP version 25.7, wsrep_25.7.r4126
Config file my.cnf:
[mysqld_safe]
open-files-limit = 120000
[mysqld]
# ///// General
skip-name-resolve
event_scheduler = On
user = mysql
bind-address = 10.10.91.2
port = 3306
max_connections = 2048
datadir = /var/lib/mysql
socket = /var/lib/mysql/mysql.sock
tmpdir = /tmp/mysql
symbolic-links = 0
table_open_cache = 4096
table_definition_cache = 4096
thread_cache_size = 256
default_storage_engine = InnoDB
ft_min_word_len = 3
large-pages
# ///// innodb
innodb_buffer_pool_size = 64G
innodb_buffer_pool_instances = 16
innodb_log_file_size = 4G
innodb_log_buffer_size = 128M
innodb_log_group_home_dir = /var/lib/mysql_logs/innodb
innodb_data_file_path = /ibdata1:64M:autoextend
innodb_open_files = 4096
innodb_file_per_table = 1
innodb_rollback_on_timeout = On
innodb_flush_log_at_trx_commit = 0
innodb_doublewrite = 0
innodb_flush_method = O_DIRECT
innodb_lock_wait_timeout = 300
innodb_flush_neighbors = 0
innodb_support_xa = 0
innodb_autoinc_lock_mode = 2 # Galera
innodb_locks_unsafe_for_binlog = 1 # Galera
innodb_io_capacity = 100
# ///// MyISAM
key_buffer_size = 128M
query_cache_size = 0
# ///// binlog \ relaylog
log-bin = /var/lib/mysql_logs/binary/binlog
max_binlog_size = 1024M
binlog_format = ROW
binlog_cache_size = 5M
expire_logs_days = 1
max_binlog_files = 10
sync_binlog = 0
relay_log = /var/lib/mysql_logs/relay/relaylog
slave_load_tmpdir = /tmp/mysql
log_slave_updates = On
# ///// BEGIN Replication
server-id = 10
slave_parallel_workers = 4
skip-slave-start = On
log_bin_trust_function_creators = ON
# ///// log
log_error = "/var/log/mysql/error.log"
# ///// galera
wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_provider_options="gcache.size=32G; gcache.name = /var/lib/mysql_logs/galera/galera.cache;"
wsrep_cluster_address=gcomm://10.10.91.2,10.10.91.3,10.10.91.4
wsrep_node_address="10.10.91.2"
wsrep_cluster_name="PXC"
wsrep_node_name="blackbird"
wsrep_sst_method = xtrabackup-v2
wsrep_sst_auth = sst_xtrabackup:passhere
wsrep_notify_cmd = '/usr/local/bin/wsrep_notify.sh'
wsrep_replicate_myisam=On
wsrep_forced_binlog_format = ROW
wsrep_log_conflicts = Off
wsrep_auto_increment_control = On
wsrep_retry_autocommit = 10
wsrep_slave_threads = 64
wsrep_convert_LOCK_to_trx = 1
On all nodes in the same configuration, only different: wsrep_node_address, wsrep_node_name and server_id.
Htop after the problem occurs:
Someone have any ideas?
log_3_node.txt (13.8 KB)
log_2_node.txt (17.4 KB)