Strange node hang

Hello,

We have a following setup:

  • 4 Percona cluster nodes running 5.5.31-23.7.5.438. build x64
  • 1 master r/w node with others standing as backup over HAProxy

The issue happens occasionally at random (sometimes two days in a row, sometimes after a few weeks).
The active node just stops processing queries and process list grows. There’s absolutely nothing in the error log (we have warnings enabled as well). By nothing I mean
there’s no usual warnings in log in that period as well (as in usual I mean “update was ineffective” sort of thing).

After that we issue a node restart and then it starts up normally. I have zero log trace that I can analyze (at least those I know about, such as mysqld error log, syslog etc.)
It’s not an memory/CPU issue, I’m monitoring server 0/24 for performance and there’s no anomaly in graphs.

What’s even worse, clustercheck script sees the node as fully synced and operational. I can connect to the instance, issue a query and it gets added to process list, but waits indefinitely as do other queries issued. So we have backup nodes ready to kick in but HAProxy never detects the outage.

The my.cnf buffer and memory parameters are the same as we had with standalone Percona Server that never hung and this server has even extra 8G of RAM over the standalone instance (40GB total).

Any ideas where to start the troubleshooting?

I have upped gcache.size from 128M to 4GB. Also, have modified wsrep slave threads from 1 to 32.
What worries me is that I’m getting ‘wsrep_cert_deps_distance’ no larger than 1.

I have noticed (during the yesterday’s hangup) that the oldest query among those piling up was related with “REPLACE INTO” on a myisam table (we have two altogether).
The replication queue was empty (0) and the cluster rep was not stalled.

Could this be the issue? What I’ve read mostly is that myisam replication is unreliable (regarding consistency) but nothing on the issue of hanging a node.
I could switch off myisam replication in my.cnf but am unwilling to do so until I’m certain that this is causing lockups.

Can you paste a sample show processlist part from a moment a node is hung? What are the queries waiting for? Also, I wonder if the same issue happens after converting those tables to InnoDB?

I will post the full processlist as soon as it happens again. We have upgraded nodes to 5.33, no hangups yet.

I don’t know what the queries are waiting for, they seem to be in various states of execution. There is also a certain number of ‘wsrep in pre-commit stage’ processes.
You can execute show status/variables etc, but no DB query ever gets executed.

As for InnoDB conversion, we have to keep one table on MyISAM because of the full-text search. We could turn off MyISAM replication off on the cluster but I want to make sure it is the root cause of the hangups.

We had two hangs yesterday. I haven’t been able to get a processlist since I haven’t been at PC (did restart via webmin script over smartphone), but I did find this on every node except master node (others are backup/standby nodes):

WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
131004 10:48:57 [Note] WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
131004 10:48:57 [Note] WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
131004 10:48:57 [Note] WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
131004 10:48:57 [Note] WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
131004 10:48:57 [Note] WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping
131004 10:48:57 [Note] WSREP: (7569e389-2698-11e3-8693-139228b987aa, ‘tcp://0.0.
0.0:4567’) address ‘tcp://xx.xx.xx.xx:4567’ pointing to uuid 7569e389-2698-11e3-
8693-139228b987aa is blacklisted, skipping

The master node has absolutely nothing relevant logged until restart. I have the following entries on both hangs. What do these messages indicate? xx.xx.xx.xx is the IP of the master node.

Hello,

You asked me for full processlist when cluster hangs…can I PM it to you? Don’t want to obfuscate the queries and I rather wouldn’t see DB queries shown on Google search. :smiley:
We have some attacks now and then.

Thanks in advance.

Regards,

Marko

Sorry for long delay, I forgot about this thread.
This message:

WSREP: (xxx, 'tcp://0.0.0.0:4567') address 'tcp://xx.xx.xx.xx:4567' pointing to uuid xxx is blacklisted, skipping

is usually seen on a nodes trying to connect to the cluster, but no working node is in primary state. So looks like the master node went into non-Primary state after this hangup.
To allow them to re-join, you need to force the primary state on the surviving node like this:

SET GLOBAL wsrep_provider_options="pc.bootstrap=1";

and then starting mysql on the other nodes should succeed.

Yes, you can send me the processlist via priv.

It seems you cannot receive private messages :slight_smile:

At the time of the hangup, the primary node (read/write, others a re only for backup) becomes non responsive and I’m seeing “blacklisted” messages on other nodes while the primary node has nothing in error log and never recovers until restart.

When I restart the primary node only, the rest of the cluster quickly syncs with the primary.

This is the wsrep status dump at the time of the hangup (please note that you can connect to the primary node and exec variable and status queries, but no query on actual tables succeeds, only waits indefinitely until number of connections is saturated):

Variable_name Value
wsrep_local_state_uuid 4c3aae36-ff25-11e2-b3f2-f2c8c18d67ea
wsrep_protocol_version 4
wsrep_last_committed 201895717
wsrep_replicated 17129710
wsrep_replicated_bytes 17889140306
wsrep_received 35425
wsrep_received_bytes 982833
wsrep_local_commits 17127813
wsrep_local_cert_failures 0
wsrep_local_bf_aborts 0
wsrep_local_replays 0
wsrep_local_send_queue 0
wsrep_local_send_queue_avg 0.000000
wsrep_local_recv_queue 34
wsrep_local_recv_queue_avg 0.000000
wsrep_flow_control_paused 1.000000
wsrep_flow_control_sent 0
wsrep_flow_control_recv 0
wsrep_cert_deps_distance 193.730000
wsrep_apply_oooe 0.000000
wsrep_apply_oool 0.000000
wsrep_apply_window 0.000000
wsrep_commit_oooe 0.000000
wsrep_commit_oool 0.000000
wsrep_commit_window 0.000000
wsrep_local_state 4
wsrep_local_state_comment Synced
wsrep_cert_index_size 368
wsrep_causal_reads 0
wsrep_incoming_addresses 10.42.71.90:3306,10.42.71.69:3306,10.42.71.68:3306,10.42.71.91:3306
wsrep_cluster_conf_id 121
wsrep_cluster_size 4
wsrep_cluster_state_uuid 4c3aae36-ff25-11e2-b3f2-f2c8c18d67ea
wsrep_cluster_status Primary
wsrep_connected ON
wsrep_local_index 2
wsrep_provider_name Galera
wsrep_provider_vendor Codership Oy <info@codership.com>
wsrep_provider_version 2.7(r157)
wsrep_ready ON