I have installed Patroni on 3 Nodes when I started the 1st node I got the following error “CRITICAL: system ID mismatch, node belongs to a different cluster”
I was able to run the following steps to fix this issue
– Stop the Patroni service on all nodes in the cluster.
sudo systemctl stop patroni
– Remove the old cluster from etcd. You can do this by running the following command:
Please confirm the cluster name to remove: cluster_1
You are about to remove all information in DCS for cluster_1, please type: “Yes I am aware”: Yes I am aware
And when I started the Patroni it worked and I have 1 node in my Patroni Cluster list
But when I want to add the other 2 nodes I get the same issue and I am unable to resolve it.
I tried the following steps as well without success
Stop the Patroni service on the node :
systemctl stop patroni
Remove the data directory on Node. Be careful with this step as it will delete all data on the node.
rm -rf /mnt/sdb/postgresql/*
Start the Patroni service on Node. This will initialize a new PostgreSQL instance with a new system identifier and add it to the cluster as a replica.
systemctl start patroni
Check the cluster status to ensure the node has been added successfully.
patronictl -c /etc/patroni/patroni.yml list
This is a New Install
$ which patroni
/usr/bin/patroni
$ cat /etc/redhat-release
AlmaLinux release 9.4 (Seafoam Ocelot)
$ patroni
Config is empty.
Hello,
Similar symptoms may occur if you had another postgres instance running on your server with Patroni, when you started your service for the first time.
For example, if you had an instance initialized manually, running on port 5432, and then started Patroni, telling to initialize a cluster and then start it on port 5432. It will initialize it, but then it will not be able to start, because another process is already listening on port you used. So it is trying to connect to the running postgres instance, that was initialized outside of Patroni, and will write this different cluster system ID to DCS.
Always make sure, there are no other PostgreSQL clusters running on your server, and make sure to choose an available port.
Also, sharing Patroni log would be helpful to find what exactly happened. If you do not have logfile configured, you should be able to find it running journalctl -u patroni as root to find more info. Sometimes, the information you are looking for is also in postgres log, that was initialized by Patroni.
3 Postgresql nodes with Patroni
with Initial build only Node 3 works and Patroni runs perfectly the other 2 nodes give this error “CRITICAL: system ID mismatch, node node1 belongs to a different cluster”
I left Node 3 running and did rebuild node 1 from scratch but the same issue when starting Patroni.
sudo systemctl status patroni
× patroni.service - Runners to orchestrate a high-availability PostgreSQL
Loaded: loaded (/etc/systemd/system/patroni.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Wed 2024-07-24 14:34:27 SAST; 11s ago
Duration: 1.675s
Process: 37377 ExecStart=/bin/patroni /etc/patroni/patroni.yml (code=exited, status=1/FAILURE)
Main PID: 37377 (code=exited, status=1/FAILURE)
CPU: 810ms
Jul 24 14:34:25 08-00-27-26-5E-D6 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL.
Jul 24 14:34:26 08-00-27-26-5E-D6 patroni[37377]: 2024-07-24 14:34:26,390 INFO: Selected new etcd server http://192.168.8.33:2379
Jul 24 14:34:26 08-00-27-26-5E-D6 patroni[37377]: 2024-07-24 14:34:26,462 INFO: No PostgreSQL configuration items changed, nothing to reload.
Jul 24 14:34:26 08-00-27-26-5E-D6 patroni[37377]: 2024-07-24 14:34:26,573 CRITICAL: system ID mismatch, node node1 belongs to a different cluster: 7395135091342194422 != 7395176397735409202
Jul 24 14:34:27 08-00-27-26-5E-D6 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Jul 24 14:34:27 08-00-27-26-5E-D6 systemd[1]: patroni.service: Failed with result ‘exit-code’.
$ sudo journalctl -fu patroni
Jul 24 14:34:14 08-00-27-26-5E-D6 patroni[37366]: 2024-07-24 14:34:14,153 INFO: No PostgreSQL configuration items changed, nothing to reload.
Jul 24 14:34:14 08-00-27-26-5E-D6 patroni[37366]: 2024-07-24 14:34:14,355 CRITICAL: system ID mismatch, node node1 belongs to a different cluster: 7395135091342194422 != 7395176397735409202
Jul 24 14:34:14 08-00-27-26-5E-D6 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Jul 24 14:34:14 08-00-27-26-5E-D6 systemd[1]: patroni.service: Failed with result ‘exit-code’.
Jul 24 14:34:25 08-00-27-26-5E-D6 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL.
Jul 24 14:34:26 08-00-27-26-5E-D6 patroni[37377]: 2024-07-24 14:34:26,390 INFO: Selected new etcd server http://192.168.8.33:2379
Jul 24 14:34:26 08-00-27-26-5E-D6 patroni[37377]: 2024-07-24 14:34:26,462 INFO: No PostgreSQL configuration items changed, nothing to reload.
Jul 24 14:34:26 08-00-27-26-5E-D6 patroni[37377]: 2024-07-24 14:34:26,573 CRITICAL: system ID mismatch, node node1 belongs to a different cluster: 7395135091342194422 != 7395176397735409202
Jul 24 14:34:27 08-00-27-26-5E-D6 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Jul 24 14:34:27 08-00-27-26-5E-D6 systemd[1]: patroni.service: Failed with result ‘exit-code’.
$ sudo patronictl -c /etc/patroni/patroni.yml list
Cluster: cluster_1 (7395135091342194422) ±—±----------±----------------±-----------------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart | Pending restart reason |
±-------±-------------±-------±--------±—±----------±----------------±-----------------------+
| node3 | 192.168.8.43 | Leader | running | 4 | | * | max_wal_senders: 10->5 |
±-------±-------------±-------±--------±—±----------±----------------±-----------------------+
$ which patroni
/usr/bin/patroni
$ cat /etc/redhat-release
AlmaLinux release 9.4 (Seafoam Ocelot)
$ etcdctl --write-out=table --endpoints=http://192.168.8.31:2379 member list
±-----------------±--------±------±-------------------------±-------------------------±-----------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
±-----------------±--------±------±-------------------------±-------------------------±-----------+
| 82a8a38ad0f79276 | started | node3 | http://192.168.8.33:2380 | http://192.168.8.33:2379 | false |
| dfc606379f2d3069 | started | node1 | http://192.168.8.31:2380 | http://192.168.8.31:2379 | false |
| f79b10d525fb3c49 | started | node2 | http://192.168.8.32:2380 | http://192.168.8.32:2379 | false |
±-----------------±--------±------±-------------------------±-------------------------±-----------+
Copy of Config for Node 1
namespace: percona_lab
scope: cluster_1
name: node1
restapi:
listen: 0.0.0.0:8008
connect_address: 192.168.8.41:8008
etcd3:
# host: fe80::a00:27ff:fedc:1231%enp0s8:2379
hosts: 192.168.8.31:2379,192.168.8.32:2379,192.168.8.33:2379
bootstrap:
# this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
slots:
percona_cluster_1:
type: physical
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
wal_level: replica
hot_standby: on
wal_keep_segments: 10
max_wal_senders: 5
max_replication_slots: 10
wal_log_hints: on
logging_collector: 'on'
# some desired options for 'initdb'
initdb: # Note: It needs to be a list (some options need values, others are switches)
- encoding: UTF8
- data-checksums
pg_hba: # Add following lines to pg_hba.conf after running 'initdb'
- host replication replicator 127.0.0.1/32 trust
- host replication replicator 0.0.0.0/0 md5
- host all all 0.0.0.0/0 md5
- host all all ::0/0 md5
# Some additional users which needs to be created after initializing new cluster
users:
admin:
password: qaz123
options:
- createrole
- createdb
percona:
password: qaz123
options:
- createrole
- createdb
postgresql:
cluster_name: cluster_1
listen: 0.0.0.0:5432
connect_address: 192.168.8.41:5432
data_dir: /mnt/sdb/postgresql
bin_dir: /usr/pgsql-15/bin
pgpass: /tmp/pgpass
authentication:
replication:
username: replicator
password: replPasswd
superuser:
username: postgres
password: qaz123
parameters:
unix_socket_directories: /var/run/postgresql/
create_replica_methods:
- basebackup
basebackup:
checkpoint: 'fast'
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: falsetype or paste code here
Copy of Config for Node 3
namespace: percona_lab
scope: cluster_1
name: node3
restapi:
listen: 0.0.0.0:8008
connect_address: 192.168.8.43:8008
etcd3:
# host: fe80::a00:27ff:fe8e:9102%enp0s3:2379
hosts: 192.168.8.31:2379,192.168.8.32:2379,192.168.8.33:2379
bootstrap:
# this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
slots:
percona_cluster_1:
type: physical
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
wal_level: replica
hot_standby: on
wal_keep_segments: 10
max_wal_senders: 5
max_replication_slots: 10
wal_log_hints: on
logging_collector: 'on'
# some desired options for 'initdb'
initdb: # Note: It needs to be a list (some options need values, others are switches)
- encoding: UTF8
- data-checksums
pg_hba: # Add following lines to pg_hba.conf after running 'initdb'
- host replication replicator 127.0.0.1/32 trust
- host replication replicator 0.0.0.0/0 md5
- host all all 0.0.0.0/0 md5
- host all all ::0/0 md5
# Some additional users which needs to be created after initializing new cluster
users:
admin:
password: qaz123
options:
- createrole
- createdb
percona:
password: qaz123
options:
- createrole
- createdb
postgresql:
cluster_name: cluster_1
listen: 0.0.0.0:5432
connect_address: 192.168.8.43:5432
data_dir: /mnt/sdb/postgresql
bin_dir: /usr/pgsql-15/bin
pgpass: /tmp/pgpass
authentication:
replication:
username: replicator
password: replPasswd
superuser:
username: postgres
password: qaz123
parameters:
unix_socket_directories: /var/run/postgresql/
create_replica_methods:
- basebackup
basebackup:
checkpoint: 'fast'
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
I have an update and to be honest I am more confused than ever. I have tried a lot of things by removing the cluster renaming the cluster
namespace: percona_lab2
scope: cluster_2
All of this just promoted 1 of the nodes to work. It started with Node 3 working, and Node 1 and 2 having errors, Then it was node 2. And now I have Node 1. At one stage it looked promising but after restart back to square 1
Hey,
Just to make sure, you are following proper procedure.
To remove Patroni configuration from ETCD properly and initiate new cluster, you should:
Stop Patroni services on all nodes.
Run patronictl -c /etc/patroni/patroni.yml remove
Remove content of directories “/mnt/sdb/postgresql” on all servers, and double check it was removed.
Run ps -ef | grep postgres on all servers to make sure there is no other postgres running on them, and if there is any verify that it is not listening on port 5432.
Start patroni on one of the nodes, verify it bootstrapped properly and was started as leader.
Every PostgreSQL instance has its own system ID, which will be recorded in its control file.
This can be checked using the pg_controldata utility by passing the data directory as an argument; for example,
-bash-4.2$ pg_controldata -D /var/lib/pgsql/16/data/
pg_control version number: 1300
Catalog version number: 202307071
Database system identifier: 7384000357691318611
Database cluster state: shut down
pg_control last modified: Mon 22 Jul 2024 11:14:38 AM UTC
...
In a Patroni cluster, this will be stored inside the DCS (etcd) to ensure that Patroni is bringing up the right PostgreSQL instance. This system ID will be the same across all nodes of a replication cluster (Standby)
If etcd is not clean or contains the system ID of a previously used PostgreSQL Instance, the system ID won’t match after the backup restoration. This is a protection mechanism to ensure that the Patroni is dealing with the right cluster, and it is independent of the restoration tool.
So whenever nodes are reused to form a new cluster, it’s a good idea to ensure that the etcd (DCS) is clean.
patronictl remove <cluster-name>
If there is no systemid existing in the DCS, Patroni will make fresh entries on startup.
Thanks, this was of one of the steps I ran when I changed the Cluster Name to cluster_2 to see if that will work. But I am going to look at everything and let you know if I am able to resolve the issue. I see the command I ran looks a bit different will confirm, this was the command I used patronictl -c /etc/patroni/patroni.yml remove cluster_1"
Hi @mateusz.henicz and @lalit.choudhary
I have configured a Patroni cluster with 3 DB nodes and 3 ETCD nodes. However, I can only see one node in the ETCD member list and the cluster may sometimes go into an uninitialized state. I would appreciate it if you could provide a solution for this issue.
I have completely removed the cluster and added a new cluster name, but I am still experiencing the same issue.