PXC stop working and overloads the server

Hi.
We are trying to implement a PXC, and came across a strange problem. The tests were conducted on a bunch PXC + zabbix.

Configuration Server (3 nodes):

First node (10.10.92.3, neon):
CPU: 2 X Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Memory: 96Gb
Storage: 3 SSD drives Samsung 840 Pro 512Gb (with LVM)

Second node (10.10.91.4, natrium):
CPU: 2 X Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Memory: 96Gb
Storage: 3 SSD drives Samsung 840 Pro 512Gb (with LVM)

Third node (10.10.92.2, blackbird):
CPU: 2 X Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
Memory: 96Gb
Storage: 3 SSD drives Samsung 840 Pro 512Gb (with LVM)

All nodes are connected with 10G network interface directly!
At the time of testing, zabbix is only connected to the third node (without balancing).

What’s going on:

Immediately after the launch zabbix, everything works correctly. Approximately, after 8-12 hours the problems begin. On one of the nodes (different one each time) significantly rises the load on the CPU (system time). So, it is not even possible to connect via SSH.
When this happens, disconnecting zabbix and other servers does not solve the problem, the load on the CPU is not reduced.
The last time there was a problem with the second server. Other nodes just lose connection with the problematic one. Here is the logs from third node: see attachment ‘log_3_node.txt’

Log out of the problem of the server: see attachment ‘log_2_node.txt’

We use: Server version: 5.6.20-68.0-56-log Percona XtraDB Cluster (GPL), Release rel68.0, Revision 888, WSREP version 25.7, wsrep_25.7.r4126

Config file my.cnf:


[mysqld_safe]

open-files-limit = 120000


[mysqld]

# ///// General

skip-name-resolve

event_scheduler = On

user = mysql
bind-address = 10.10.91.2
port = 3306
max_connections = 2048

datadir = /var/lib/mysql
socket = /var/lib/mysql/mysql.sock
tmpdir = /tmp/mysql

symbolic-links = 0

table_open_cache = 4096
table_definition_cache = 4096

thread_cache_size = 256

default_storage_engine = InnoDB
ft_min_word_len = 3

large-pages




# ///// innodb

innodb_buffer_pool_size = 64G
innodb_buffer_pool_instances = 16

innodb_log_file_size = 4G
innodb_log_buffer_size = 128M
innodb_log_group_home_dir = /var/lib/mysql_logs/innodb
innodb_data_file_path = /ibdata1:64M:autoextend

innodb_open_files = 4096
innodb_file_per_table = 1
innodb_rollback_on_timeout = On
innodb_flush_log_at_trx_commit = 0
innodb_doublewrite = 0
innodb_flush_method = O_DIRECT
innodb_lock_wait_timeout = 300
innodb_flush_neighbors = 0

innodb_support_xa = 0

innodb_autoinc_lock_mode = 2 # Galera
innodb_locks_unsafe_for_binlog = 1 # Galera

innodb_io_capacity = 100



# ///// MyISAM

key_buffer_size = 128M
query_cache_size = 0


# ///// binlog \ relaylog

log-bin = /var/lib/mysql_logs/binary/binlog
max_binlog_size = 1024M
binlog_format = ROW
binlog_cache_size = 5M
expire_logs_days = 1
max_binlog_files = 10
sync_binlog = 0

relay_log = /var/lib/mysql_logs/relay/relaylog
slave_load_tmpdir = /tmp/mysql
log_slave_updates = On

# ///// BEGIN Replication

server-id = 10

slave_parallel_workers = 4
skip-slave-start = On
log_bin_trust_function_creators = ON


# ///// log

log_error = "/var/log/mysql/error.log"


# ///// galera

wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_provider_options="gcache.size=32G; gcache.name = /var/lib/mysql_logs/galera/galera.cache;"

wsrep_cluster_address=gcomm://10.10.91.2,10.10.91.3,10.10.91.4

wsrep_node_address="10.10.91.2"

wsrep_cluster_name="PXC"
wsrep_node_name="blackbird"

wsrep_sst_method = xtrabackup-v2
wsrep_sst_auth = sst_xtrabackup:passhere

wsrep_notify_cmd = '/usr/local/bin/wsrep_notify.sh'
wsrep_replicate_myisam=On
wsrep_forced_binlog_format = ROW
wsrep_log_conflicts = Off
wsrep_auto_increment_control = On
wsrep_retry_autocommit = 10
wsrep_slave_threads = 64
wsrep_convert_LOCK_to_trx = 1


On all nodes in the same configuration, only different: wsrep_node_address, wsrep_node_name and server_id.
Htop after the problem occurs:

Someone have any ideas?

log_3_node.txt (13.8 KB)

log_2_node.txt (17.4 KB)

That kind of CPU saturation probably means there is some some kind of contention issue.
So you put the stress test on 3rd node which is then replicated to 1st and 2nd nodes. But there is quite significant difference between the Xeons E5-2660 and E5620! So the master machine is much more powerful in terms of CPU then the two others, which can lead to a situation that these weaker slave nodes cannot keep up with the traffic. This is IMHO not an optimal situation for Galera based cluster.
Any way, if you cannot ssh to stalled machine, I think you should set up pt-stalk in daemon mode to execute status collection during CPU overload.
This way you should be able to gather lots of useful data why the system is so overloaded.

przemek, thank you for you reply.
As far as I understand, this is controlled by a “Flow Control” feature. Could you please tell, in the case of multimaster replication, what values will be the best for ‘fc’ to make transactions applying simultaneously? So, if one server is slower than the other, then the entire cluster to be at that speed (we will update the hardware, but until this moment I would like to avoid such errors).
Thank you very much.

This particular problem does not quite look like Flow Control being triggered. If you could not even connect the server with ssh, then it must be something more. Flow control pauses all writes in case one of the nodes can’t keep up with replication. I wonder though what happens when the server is paused and maybe all connections are used (max_connections = 2048), did you see any increased memory usage problems? Maybe the server was swapping and that made it hard to ssh in.

Hi.
przemek, swap is disabled in all servers.
For test purposes, I switched zabbix to work with node number 1 (10.10.92.3, neon).
Zabbix-server has been working more than a day, and when I started to using the web interface, the problem have appeared again. I stopped zabbix-server, and after a few seconds the cluster recovered. However, just after 1-2 seconds, I saw again the full load of one core (system time). I performed “show processlist” and saw the following entry:

| 69582 | root | localhost | | Query | 36 | Writing to net | show / *! 50002 GLOBAL * / status | 455 | 455 |

The query to execute scripts from the package “percona-zabbix-templates”. Query execution time - 36 seconds, and the state - “Writing to net”. Once seen, I again started zabbix-server and tried to reproduce the error. I did it, but the console just stopped responding to me. I again stopped zabbix-server and quickly called “show full processlist”, and that’s what I got:

[root @ neon absokolov] # cat ./full_proc | grep -v 'system user' | grep -v 'Rows_examined'
72733 __zabbix 194.50.85.240:49668 __zabbix Query 22 Writing to net select distinct t.triggerid, t.description, t.expression, t.error, t.priority, t.type, t.value, t.state, t.lastchange from hosts h, items i, functions f, triggers t where h.hostid = i.hostid and i.itemid = f.itemid and f.triggerid = t.triggerid and h.status = 0 and i.status = 0 and t .status = 0 and t.flags <> 2 45964 0
72747 __zabbix 194.50.85.240:49672 __zabbix Query 13 wsrep in pre-commit stage COMMIT 0 0
72748 unauthenticated user 194.50.85.240:49673 NULL Connect NULL Writing to net NULL 0 0
72749 unauthenticated user 194.50.85.240:49674 NULL Connect NULL Writing to net NULL 0 0
72751 unauthenticated user 194.50.85.240:49676 NULL Connect NULL Writing to net NULL 0 0
72755 root localhost NULL Query 0 init show full processlist 0 0

Again, the status of processes “Writing to net”.

‘iperf’ result from node 1 to node 2 (10G direct physical link) :

[root&#64;neon /]# iperf -c 10.10.91.4
------------------------------------------------------------
Client connecting to 10.10.91.4, TCP port 5001
TCP window size: 19.3 KByte (default)
------------------------------------------------------------
[ 3] local 10.10.91.3 port 38371 connected with 10.10.91.4 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 10.9 GBytes 9.38 Gbits/sec


‘iperf’ result from node 1 to zabbix-server (KVM virtual machine):

[root&#64;neon /]# iperf -c 194.50.85.240
------------------------------------------------------------
Client connecting to 194.50.85.240, TCP port 5001
TCP window size: 19.3 KByte (default)
------------------------------------------------------------
[ 3] local 194.50.85.10 port 37159 connected with 194.50.85.240 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.10 GBytes 942 Mbits/sec


On the “Percona Server” (without clustering) all works very well.

The number of connections:

mysql> show status like 'Max_used%';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| Max_used_connections | 118 |
+----------------------+-------+
1 row in set (0,00 sec)


show / *! 50002 GLOBAL * / status is nothing unusual, any monitoring software for MySQL is reading status variables to see the server metrics. The question is why it stalls for 36 seconds in “writing to net” status.

Hi.
I did a lot of tests with different settings and got an interesting results.
We have found out that the problem depends on the setting of “gcache.size”. Recent tests:

  1. “gcache.size = 32G” - this is the value I used initially.
  2. “gcache.size = 10G” - nothing has changed, the problem persists.
  3. “gcache.size = 2G” - cluster is much more stable. Several times nodes disconnected, but then recovered (but not always).
  4. “gcache.size = 1G” - cluster is stable for several days. Periodically I see a message about creating and deleting additional pages “gcache.page.0000xx”.

The documentation says that “gcache” is only writesets log which allows you to recover the node in the cluster, using the method of IST.
So, how could it be connected? Why changing the value of “gcache.size” does change the cluster’s behavior?

I also experimented with the parameters: sync_binlog, innodb_flush_log_at_trx_commit and innodb_log_file_size. I know that these parameters are not related to gcache, I just tried to change the load on the input-output subsystem and watch for changes.

Does anyone have any ideas?
Thank you!

I’m curious, which version of Zabbix are you running?

Running Zabbix 2.2 and a slightly older (probably about 6-8 months) version of PXC worked great. As soon as we upgraded to Zabbix 2.4 and PXC 5.6.20-25.7.888, we started having this same issue.

I originally thought that it was something that changed in Zabbix 2.4 that caused the issue, but I’m starting to see something similar on a 5.6.20-25.7.888 cluster that isn’t running Zabbix. Thinking about downgrading now. I know 5.6.19-25.6.824 works.