Wsrep_cert_deps_distance/wsrep_local_recv_queue_avg high values decreasing

dclem · August 22, 2021, 9:10am

Hi , i have a 3 node Percona Mysql cluster 57-5.7.25-31 . Unfortunately i’ve lost one node and i had to rebuild competly by WSREP layer ( node after sst became synced with others )

Since the recover i get high values of wsrep_cert_deps_distance and wsrep_local_recv_queue_avg but decreasing .

Afaik these values has to decrease to 0.X values ( as like other nodes ) .

I would like to know if it safe wait reduction of these values before to enable the node to application .

Current status
node 1 : wsrep_cert_deps_distance | 588.899176
wsrep_local_recv_queue_avg | 59009.332682 - initial value after SST 362700.476338
node 2 : wsrep_cert_deps_distance | 52.880276
wsrep_local_recv_queue_avg | 0.031238
node 3 : wsrep_cert_deps_distance | 52.870661
wsrep_local_recv_queue_avg | 0.634849

All nodes are synced with none latency and spikes on flow control

dclem · August 22, 2021, 7:49pm

from official doc

There are two status variables you can use to find slow nodes in a cluster: wsrep_flow_control_sent and wsrep_local_recv_queue_avg. Check these status variables on each node in a cluster. The node that returns the highest value is the slowest one.

The wsrep_flow_control_sent variable provides the number of times a node sent a pause event due to flow control since the last status query. The wsrep_local_recv_queue_avg varaible returns an average of the received queue length since the last status query. Nodes that return values much higher than 0.0 indicate that it cannot apply write-sets as fast as they are received and can generate replication throttling.

Check these status variables on each node in your cluster. The node that returns the highest value is the slowest node. Lower values are preferable.

Michael_Coburn · August 23, 2021, 3:28pm

Hi @dclem thanks for posting to the Percona forums!
It appears that node 1 is unable to apply the volume of data changes in the cluster as indicated by high wsrep_local_recv_queue_avg . What makes this node unique in your deployment? Does it have less vCPU than the other nodes, or slower disk?
If you use PMM, from the PXC/Galera Cluster Summary dashboard, can share with us the graphs:

flow control paused time
flow control messages sent
writeset outbound traffic
average galera replication latency

dclem · August 25, 2021, 3:59pm

No ,nothing different than other nodes wsrep_local_recv_queue_avg as the node recovered from SST started with high values ,then decreasing and still decreasing ,now

±-------------------------±-----------+
| Variable_name | Value |
±-------------------------±-----------+
| wsrep_cert_deps_distance | 251.988942 |
±-------------------------±-----------+
1 row in set (0.00 sec)

±---------------------------±-------------+
| Variable_name | Value |
±---------------------------±-------------+
| wsrep_local_recv_queue_avg | 20697.827244 |
±---------------------------±-------------+

A question

i’ve found these warnings ,something i should care of ?

" 2021-08-25T10:32:08.051639+01:00 0 [Warning] WSREP : unserialize error invalid protocol version 2: 71 (Protocol error)

2021-08-25T10:32:08.055746+01:00 0 [Warning] WSREP : unserialize error invalid protocol version 4: 71 (Protocol error)

2021-08-25T10:32:08.057831+01:00 0 [Warning] WSREP : unserialize error invalid protocol version 5: 71 (Protocol error)

2021-08-25T10:32:08.058535+01:00 0 [Warning] WSREP : unserialize error invalid protocol version 6: 71 (Protocol error)

2021-08-25T10:32:08.351899+01:00 0 [Warning] WSREP : unserialize error invalid protocol version 6: 71 (Protocol error)

2021-08-25T10:32:09.077429+01:00 0 [Warning] WSREP : unserialize error invalid protocol version 3: 71 (Protocol error)"

on pmm metrics

flow control paused time. ← sporadic and low
flow control messages sent ← just 2 times max spike 0.02
writeset outbound traffic. ← max 3M
average galera replication latency ← max spike reached on 1 node 3.13ms ( going back on last 30 days )

Topic		Replies	Views
PXC - wsrep_local_recv_queue_avg Percona XtraDB Cluster 5.x	3	2458	January 8, 2014
Recurring locks and slowdowns on a Percona XtraDB Cluster Percona XtraDB Cluster 5.x	0	713	September 6, 2013
wsrep in pre-commit stage :-( :-( :-( Percona XtraDB Cluster 5.x	1	1030	September 10, 2015
[Warning] WSREP: last inactive check more than PT1.5S ago, skipping check Percona XtraDB Cluster 5.x	1	4917	July 28, 2015
cluster query execute hang at wsrep in pre-commit stage Percona XtraDB Cluster 5.x	2	1204	April 19, 2016

Wsrep_cert_deps_distance/wsrep_local_recv_queue_avg high values decreasing

Related topics