Wsrep_cert_deps_distance/wsrep_local_recv_queue_avg high values decreasing

Hi , i have a 3 node Percona Mysql cluster 57-5.7.25-31 . Unfortunately i’ve lost one node and i had to rebuild competly by WSREP layer ( node after sst became synced with others )

Since the recover i get high values of wsrep_cert_deps_distance and wsrep_local_recv_queue_avg but decreasing .

Afaik these values has to decrease to 0.X values ( as like other nodes ) .

I would like to know if it safe wait reduction of these values before to enable the node to application .

Current status
node 1 : wsrep_cert_deps_distance | 588.899176
wsrep_local_recv_queue_avg | 59009.332682 - initial value after SST 362700.476338
node 2 : wsrep_cert_deps_distance | 52.880276
wsrep_local_recv_queue_avg | 0.031238
node 3 : wsrep_cert_deps_distance | 52.870661
wsrep_local_recv_queue_avg | 0.634849

All nodes are synced with none latency and spikes on flow control

1 Like

from official doc

There are two status variables you can use to find slow nodes in a cluster: wsrep_flow_control_sent and wsrep_local_recv_queue_avg. Check these status variables on each node in a cluster. The node that returns the highest value is the slowest one.

The wsrep_flow_control_sent variable provides the number of times a node sent a pause event due to flow control since the last status query. The wsrep_local_recv_queue_avg varaible returns an average of the received queue length since the last status query. Nodes that return values much higher than 0.0 indicate that it cannot apply write-sets as fast as they are received and can generate replication throttling.

Check these status variables on each node in your cluster. The node that returns the highest value is the slowest node. Lower values are preferable.

1 Like

Hi @dclem thanks for posting to the Percona forums!
It appears that node 1 is unable to apply the volume of data changes in the cluster as indicated by high wsrep_local_recv_queue_avg . What makes this node unique in your deployment? Does it have less vCPU than the other nodes, or slower disk?
If you use PMM, from the PXC/Galera Cluster Summary dashboard, can share with us the graphs:

  • flow control paused time
  • flow control messages sent
  • writeset outbound traffic
  • average galera replication latency
1 Like

No ,nothing different than other nodes wsrep_local_recv_queue_avg as the node recovered from SST started with high values ,then decreasing and still decreasing ,now

±-------------------------±-----------+
| Variable_name | Value |
±-------------------------±-----------+
| wsrep_cert_deps_distance | 251.988942 |
±-------------------------±-----------+
1 row in set (0.00 sec)

±---------------------------±-------------+
| Variable_name | Value |
±---------------------------±-------------+
| wsrep_local_recv_queue_avg | 20697.827244 |
±---------------------------±-------------+

A question :slightly_smiling_face:

i’ve found these warnings ,something i should care of ?

" 2021-08-25T10:32:08.051639+01:00 0 [Warning] WSREP : unserialize error invalid protocol version 2: 71 (Protocol error)

2021-08-25T10:32:08.055746+01:00 0 [Warning] WSREP : unserialize error invalid protocol version 4: 71 (Protocol error)

2021-08-25T10:32:08.057831+01:00 0 [Warning] WSREP : unserialize error invalid protocol version 5: 71 (Protocol error)

2021-08-25T10:32:08.058535+01:00 0 [Warning] WSREP : unserialize error invalid protocol version 6: 71 (Protocol error)

2021-08-25T10:32:08.351899+01:00 0 [Warning] WSREP : unserialize error invalid protocol version 6: 71 (Protocol error)

2021-08-25T10:32:09.077429+01:00 0 [Warning] WSREP : unserialize error invalid protocol version 3: 71 (Protocol error)"

on pmm metrics

  • flow control paused time. ← sporadic and low
  • flow control messages sent ← just 2 times max spike 0.02
  • writeset outbound traffic. ← max 3M
  • average galera replication latency ← max spike reached on 1 node 3.13ms ( going back on last 30 days )
1 Like