Hi there,
We are trying to setup a “Cluster” of 3 Instances via Percona Operator for MySQL 3.0 on AWS Spot Instances with GP3 Storage attached. Setup and data import work fine, checksums on all servers are consistent. However, when some spot instance decides to go down and comes up on another node, the data becomes inconsistent. Many tables are plain empty on that slave. When too many slaves have been hit with a restart the cluster goes down completely. Setup was done via helm chart and argo-cd, except from changing passwords in the secrets no customization was made. Anyone any clues?
Best Greets!
1 Like
Hi @tobias.radszuweit,
When the spot instance is terminated and pod is scheduled to some other node, it should clone the data from a healthy replica. There’s a log file /var/lib/mysql/bootstrap.log
to record the process, what do you see in this file for an affected pod?
1 Like
Hi @Ege_Gunes,
thanks very much for your reply. I had to reproduce it again, here is what happened from the start.
[2002][trad@t005: /home/trad]$ k get ps -n percona
NAME REPLICATION ENDPOINT STATE MYSQL ORCHESTRATOR HAPROXY ROUTER AGE
ps-db async ps-db-haproxy.percona ready 3 3 3 28m
[2004][trad@t005: /home/trad]$ k get pods -n percona
NAME READY STATUS RESTARTS AGE
ps-db-haproxy-0 2/2 Running 0 28m
ps-db-haproxy-1 2/2 Running 0 28m
ps-db-haproxy-2 2/2 Running 0 28m
ps-db-mysql-0 3/3 Running 0 30m
ps-db-mysql-1 3/3 Running 1 29m
ps-db-mysql-2 3/3 Running 1 28m
ps-db-orc-0 2/2 Running 0 30m
ps-db-orc-1 2/2 Running 0 29m
ps-db-orc-2 2/2 Running 1 29m
ps-operator-5489b77989-9289s 1/1 Running 0 31m
=> Data import running
- 400 tables in 4 databases, 20 GB data
=> Spot instance going away, new one is brought up
- CrashLoopBackoff was seen, but pod started
cat /var/lib/mysql/bootstrap.log
2022/11/30 16:05:48 Peers: [172-32-1-226.ps-db-mysql-unready.percona 172-32-1-230.ps-db-mysql-unready.percona]
2022/11/30 16:05:48 Primary: ps-db-mysql-0.ps-db-mysql.percona Replicas: [ps-db-mysql-1.ps-db-mysql.percona]
2022/11/30 16:05:48 FQDN: ps-db-mysql-1.ps-db-mysql.percona
2022/11/30 16:05:48 lookup ps-db-mysql-1 [172.32.1.230]
2022/11/30 16:05:48 PodIP: 172.32.1.230
2022/11/30 16:05:48 lookup ps-db-mysql-0.ps-db-mysql.percona [172.32.1.226]
2022/11/30 16:05:48 PrimaryIP: 172.32.1.226
2022/11/30 16:05:48 Donor: ps-db-mysql-0.ps-db-mysql.percona
2022/11/30 16:05:48 Opening connection to 172.32.1.230
2022/11/30 16:05:48 Clone required: true
2022/11/30 16:05:48 Checking if a clone in progress
2022/11/30 16:05:48 Clone in progress: false
2022/11/30 16:05:48 Cloning from ps-db-mysql-0.ps-db-mysql.percona
2022/11/30 16:05:50 Clone finished. Restarting container...
2022/11/30 16:06:18 Peers: [172-32-1-226.ps-db-mysql-unready.percona 172-32-1-230.ps-db-mysql-unready.percona]
2022/11/30 16:06:18 Primary: ps-db-mysql-0.ps-db-mysql.percona Replicas: [ps-db-mysql-1.ps-db-mysql.percona]
2022/11/30 16:06:18 FQDN: ps-db-mysql-1.ps-db-mysql.percona
2022/11/30 16:06:18 lookup ps-db-mysql-1 [172.32.1.230]
2022/11/30 16:06:18 PodIP: 172.32.1.230
2022/11/30 16:06:18 lookup ps-db-mysql-0.ps-db-mysql.percona [172.32.1.226]
2022/11/30 16:06:18 PrimaryIP: 172.32.1.226
2022/11/30 16:06:18 Donor: ps-db-mysql-0.ps-db-mysql.percona
2022/11/30 16:06:18 Opening connection to 172.32.1.230
2022/11/30 16:06:18 Clone required: false
2022/11/30 16:06:18 configuring replication
2022/11/30 16:06:18 bootstrap finished in 0.076536 seconds
2022/12/01 12:58:09 Peers: [172-32-1-145.ps-db-mysql-unready.percona 172-32-1-226.ps-db-mysql-unready.percona 172-32-1-245.ps-db-mysql-unready.percona]
2022/12/01 12:58:09 Primary: ps-db-mysql-0.ps-db-mysql.percona Replicas: [ps-db-mysql-1.ps-db-mysql.percona ps-db-mysql-2.ps-db-mysql.percona]
2022/12/01 12:58:09 FQDN: ps-db-mysql-1.ps-db-mysql.percona
2022/12/01 12:58:09 lookup ps-db-mysql-1 [172.32.1.245]
2022/12/01 12:58:09 PodIP: 172.32.1.245
2022/12/01 12:58:09 lookup ps-db-mysql-0.ps-db-mysql.percona [172.32.1.226]
2022/12/01 12:58:09 PrimaryIP: 172.32.1.226
2022/12/01 12:58:09 Donor: ps-db-mysql-2.ps-db-mysql.percona
2022/12/01 12:58:09 Opening connection to 172.32.1.245
2022/12/01 12:58:09 Clone required: true
2022/12/01 12:58:09 Checking if a clone in progress
2022/12/01 12:58:09 Clone in progress: false
2022/12/01 12:58:09 Cloning from ps-db-mysql-2.ps-db-mysql.percona
2022/12/01 12:58:30 Clone finished. Restarting container...
2022/12/01 12:58:49 Peers: [172-32-1-145.ps-db-mysql-unready.percona 172-32-1-226.ps-db-mysql-unready.percona 172-32-1-245.ps-db-mysql-unready.percona]
2022/12/01 12:58:49 Primary: ps-db-mysql-0.ps-db-mysql.percona Replicas: [ps-db-mysql-1.ps-db-mysql.percona ps-db-mysql-2.ps-db-mysql.percona]
2022/12/01 12:58:49 FQDN: ps-db-mysql-1.ps-db-mysql.percona
2022/12/01 12:58:49 lookup ps-db-mysql-1 [172.32.1.245]
2022/12/01 12:58:49 PodIP: 172.32.1.245
2022/12/01 12:58:49 lookup ps-db-mysql-0.ps-db-mysql.percona [172.32.1.226]
2022/12/01 12:58:49 PrimaryIP: 172.32.1.226
2022/12/01 12:58:49 Donor: ps-db-mysql-2.ps-db-mysql.percona
2022/12/01 12:58:49 Opening connection to 172.32.1.245
2022/12/01 12:58:49 Clone required: false
2022/12/01 12:58:49 bootstrap finished in 0.037973 seconds
[2003][trad@t005: /home/trad]$ k get ps -n percona
NAME REPLICATION ENDPOINT STATE MYSQL ORCHESTRATOR HAPROXY ROUTER AGE
ps-db async ps-db-haproxy.percona ready 3 3 3 20h
[2004][trad@t005: /home/trad]$ k get pods -n percona
NAME READY STATUS RESTARTS AGE
ps-db-haproxy-0 2/2 Running 0 20h
ps-db-haproxy-1 2/2 Running 0 4m50s
ps-db-haproxy-2 2/2 Running 0 5m26s
ps-db-mysql-0 3/3 Running 0 20h
ps-db-mysql-1 3/3 Running 3 4m50s
ps-db-mysql-2 3/3 Running 1 20h
ps-db-orc-0 2/2 Running 0 4m32s
ps-db-orc-1 2/2 Running 0 20h
ps-db-orc-2 2/2 Running 4 20h
ps-operator-5489b77989-ftzwr 1/1 Running 0 5m3s
Now pt-table-checksum shows many tables again out of sync and also partly, or mainly empty.
Best regards,
Tobias
1 Like
Hi @tobias.radszuweit,
Thank you for sharing. I’ll try to reproduce this on my environment earlier next week.
1 Like
Hi @Ege_Gunes,
thank you very much for your input. I might have some idea what caused this, reproducing myself at the moment, so you might want to wait with your attempts. I’ll share if i know more.
Greets!
1 Like
Hi @tobias.radszuweit,
Still not started, please keep me in the loop if you have more info.
Cheers,
1 Like
Problem solved.
Thou shalt not throw MyISAM in the mix. While this was clear to me about PCX, I hoped this would not be the case here since Percona Server normally (still) supports MyISAM. Would have solved many of our current problems, too bad. For stupid people like me, what about some documentation update that states this, or better, disabling not supported engines?
1 Like