Stateful service time out if Nodes are down in Kubernetes cluster

nethra · December 27, 2020, 11:59pm

I am evaluating DataBase solution for my kubernetes cluster, So i installed percona-xtradb-cluster:5.7.19 with three StatefulSet replicas on Bare Metal Server(CentOS) with Kubernetes 1.17.3

To analyze failure situation i shutdown one node and meanwhile my test pods(MYSQL Client) is reading data after every 5 seconds from DB Cluster I observed couple of client requests got timeout but mostly it works fine(99% works).

if i down further 1 node then out of 10 requests 5 or 6 are getting timeout by saying “ERROR 2003 (HY000) Cant connect to MySQL Server on ‘pxc’ (110)”

I tried by adding PodDisruptionBudget with

minAvailable: 1
I tried by creating NodePort and infact this is also showing intermittent behavior, mean if i directly try on specific node it sometime time-out.

It look like its due to StatefulSet as service is unable to identified failed replica, i tried with service, headless service and NodePort.

Please suggest what kind of service can help in this regard? How community is using StatefulSet in production?

I searched on net but couldn’t find satisfactory reply.

PS: its very rare situation but this can happen that out of three 2 nodes are down.

Know someone who can answer?

Sergey_Pronin · December 28, 2020, 3:26am

Hello @nethra

,
thank you for submitting this. To help you here I would need a bit more details:

Your cr.yaml

Output of the kubectl get sts YOUR_STS -o yaml when you see timeouts
Output of the kubectl get nodes and kubectl get pods -n PXC_NAMESPACE when you see timeouts
There are many variables here that can cause this behavior., but there is nothing specific about StatefulSets - they use regular primitives: ReplicaSets, Services, Endpoints, etc.

The query is routed through external loadbalancer (ex. ELB)

Gets into k8s through a Service object on any node (kube-proxy does the magic)
Service resource routes the traffic to the pods of the proxy (HAProxy/ProxySQL)
Proxy routes the traffic to pods

We need to understand which step is causing timeouts. I would suspect either external load balancer or kube-proxy routing the traffic to dead nodes.

Topic		Replies	Views
Stateful service time out if one of Node is down Percona XtraDB Cluster 5.x	0	496	June 17, 2020
Percona Operator, MySQL server has gone away, ProxySQL pods restart fixes the issue Percona XtraDB Cluster 8.x	1	486	February 7, 2024
Query: Could you please explain "cluster1-proxysql-unready" . Percona Operator for MySQL	2	972	May 15, 2020
Stateful service time out if one of Node is down Percona XtraDB Cluster 5.x	0	464	June 17, 2020
PXC cluster fails after single pod failure Percona Operator for MySQL	4	536	March 11, 2024

Stateful service time out if Nodes are down in Kubernetes cluster

Related topics