Patroni Error CRITICAL: system ID mismatch, node belongs to a different cluster

I have installed Patroni on 3 Nodes when I started the 1st node I got the following error “CRITICAL: system ID mismatch, node belongs to a different cluster”

I was able to run the following steps to fix this issue

– Stop the Patroni service on all nodes in the cluster.

sudo systemctl stop patroni

– Remove the old cluster from etcd. You can do this by running the following command:

patronictl -c /etc/patroni/patroni.yml remove cluster_1

Please confirm the cluster name to remove: cluster_1
You are about to remove all information in DCS for cluster_1, please type: “Yes I am aware”: Yes I am aware

And when I started the Patroni it worked and I have 1 node in my Patroni Cluster list

But when I want to add the other 2 nodes I get the same issue and I am unable to resolve it.

I tried the following steps as well without success

Stop the Patroni service on the node :
systemctl stop patroni

Remove the data directory on Node. Be careful with this step as it will delete all data on the node.
rm -rf /mnt/sdb/postgresql/*

Start the Patroni service on Node. This will initialize a new PostgreSQL instance with a new system identifier and add it to the cluster as a replica.

systemctl start patroni

Check the cluster status to ensure the node has been added successfully.

patronictl -c /etc/patroni/patroni.yml list

This is a New Install
$ which patroni
/usr/bin/patroni
$ cat /etc/redhat-release
AlmaLinux release 9.4 (Seafoam Ocelot)
$ patroni
Config is empty.

** This is a New Install **

Any Ideas would be welcome?

Hello,
Similar symptoms may occur if you had another postgres instance running on your server with Patroni, when you started your service for the first time.
For example, if you had an instance initialized manually, running on port 5432, and then started Patroni, telling to initialize a cluster and then start it on port 5432. It will initialize it, but then it will not be able to start, because another process is already listening on port you used. So it is trying to connect to the running postgres instance, that was initialized outside of Patroni, and will write this different cluster system ID to DCS.

Always make sure, there are no other PostgreSQL clusters running on your server, and make sure to choose an available port.

Also, sharing Patroni log would be helpful to find what exactly happened. If you do not have logfile configured, you should be able to find it running journalctl -u patroni as root to find more info. Sometimes, the information you are looking for is also in postgres log, that was initialized by Patroni.

1 Like

Thanks for your reply I will investigate and give feedback.

I did a complete new build I have:

  • 3 etcd nodes and
  • 3 Postgresql nodes with Patroni
    with Initial build only Node 3 works and Patroni runs perfectly the other 2 nodes give this error “CRITICAL: system ID mismatch, node node1 belongs to a different cluster”
    I left Node 3 running and did rebuild node 1 from scratch but the same issue when starting Patroni.

sudo systemctl status patroni
× patroni.service - Runners to orchestrate a high-availability PostgreSQL
Loaded: loaded (/etc/systemd/system/patroni.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Wed 2024-07-24 14:34:27 SAST; 11s ago
Duration: 1.675s
Process: 37377 ExecStart=/bin/patroni /etc/patroni/patroni.yml (code=exited, status=1/FAILURE)
Main PID: 37377 (code=exited, status=1/FAILURE)
CPU: 810ms

Jul 24 14:34:25 08-00-27-26-5E-D6 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL.
Jul 24 14:34:26 08-00-27-26-5E-D6 patroni[37377]: 2024-07-24 14:34:26,390 INFO: Selected new etcd server http://192.168.8.33:2379
Jul 24 14:34:26 08-00-27-26-5E-D6 patroni[37377]: 2024-07-24 14:34:26,462 INFO: No PostgreSQL configuration items changed, nothing to reload.
Jul 24 14:34:26 08-00-27-26-5E-D6 patroni[37377]: 2024-07-24 14:34:26,573 CRITICAL: system ID mismatch, node node1 belongs to a different cluster: 7395135091342194422 != 7395176397735409202
Jul 24 14:34:27 08-00-27-26-5E-D6 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Jul 24 14:34:27 08-00-27-26-5E-D6 systemd[1]: patroni.service: Failed with result ‘exit-code’.
$ sudo journalctl -fu patroni
Jul 24 14:34:14 08-00-27-26-5E-D6 patroni[37366]: 2024-07-24 14:34:14,153 INFO: No PostgreSQL configuration items changed, nothing to reload.
Jul 24 14:34:14 08-00-27-26-5E-D6 patroni[37366]: 2024-07-24 14:34:14,355 CRITICAL: system ID mismatch, node node1 belongs to a different cluster: 7395135091342194422 != 7395176397735409202
Jul 24 14:34:14 08-00-27-26-5E-D6 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Jul 24 14:34:14 08-00-27-26-5E-D6 systemd[1]: patroni.service: Failed with result ‘exit-code’.
Jul 24 14:34:25 08-00-27-26-5E-D6 systemd[1]: Started Runners to orchestrate a high-availability PostgreSQL.
Jul 24 14:34:26 08-00-27-26-5E-D6 patroni[37377]: 2024-07-24 14:34:26,390 INFO: Selected new etcd server http://192.168.8.33:2379
Jul 24 14:34:26 08-00-27-26-5E-D6 patroni[37377]: 2024-07-24 14:34:26,462 INFO: No PostgreSQL configuration items changed, nothing to reload.
Jul 24 14:34:26 08-00-27-26-5E-D6 patroni[37377]: 2024-07-24 14:34:26,573 CRITICAL: system ID mismatch, node node1 belongs to a different cluster: 7395135091342194422 != 7395176397735409202
Jul 24 14:34:27 08-00-27-26-5E-D6 systemd[1]: patroni.service: Main process exited, code=exited, status=1/FAILURE
Jul 24 14:34:27 08-00-27-26-5E-D6 systemd[1]: patroni.service: Failed with result ‘exit-code’.

$ sudo patronictl -c /etc/patroni/patroni.yml list

  • Cluster: cluster_1 (7395135091342194422) ±—±----------±----------------±-----------------------+
    | Member | Host | Role | State | TL | Lag in MB | Pending restart | Pending restart reason |
    ±-------±-------------±-------±--------±—±----------±----------------±-----------------------+
    | node3 | 192.168.8.43 | Leader | running | 4 | | * | max_wal_senders: 10->5 |
    ±-------±-------------±-------±--------±—±----------±----------------±-----------------------+
    $ which patroni
    /usr/bin/patroni
    $ cat /etc/redhat-release
    AlmaLinux release 9.4 (Seafoam Ocelot)
    $ etcdctl --write-out=table --endpoints=http://192.168.8.31:2379 member list
    ±-----------------±--------±------±-------------------------±-------------------------±-----------+
    | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
    ±-----------------±--------±------±-------------------------±-------------------------±-----------+
    | 82a8a38ad0f79276 | started | node3 | http://192.168.8.33:2380 | http://192.168.8.33:2379 | false |
    | dfc606379f2d3069 | started | node1 | http://192.168.8.31:2380 | http://192.168.8.31:2379 | false |
    | f79b10d525fb3c49 | started | node2 | http://192.168.8.32:2380 | http://192.168.8.32:2379 | false |
    ±-----------------±--------±------±-------------------------±-------------------------±-----------+

Copy of Config for Node 1

namespace: percona_lab
scope: cluster_1
name: node1

restapi:
    listen: 0.0.0.0:8008
    connect_address: 192.168.8.41:8008

etcd3:
#    host: fe80::a00:27ff:fedc:1231%enp0s8:2379
     hosts: 192.168.8.31:2379,192.168.8.32:2379,192.168.8.33:2379

bootstrap:
  # this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
  dcs:
      ttl: 30
      loop_wait: 10
      retry_timeout: 10
      maximum_lag_on_failover: 1048576
      slots:
          percona_cluster_1:
            type: physical

      postgresql:
          use_pg_rewind: true
          use_slots: true
          parameters:
              wal_level: replica
              hot_standby: on
              wal_keep_segments: 10
              max_wal_senders: 5
              max_replication_slots: 10
              wal_log_hints: on
              logging_collector: 'on'

  # some desired options for 'initdb'
  initdb: # Note: It needs to be a list (some options need values, others are switches)
      - encoding: UTF8
      - data-checksums

  pg_hba: # Add following lines to pg_hba.conf after running 'initdb'
      - host replication replicator 127.0.0.1/32 trust
      - host replication replicator 0.0.0.0/0 md5
      - host all all 0.0.0.0/0 md5
      - host all all ::0/0 md5

  # Some additional users which needs to be created after initializing new cluster
  users:
      admin:
          password: qaz123
          options:
              - createrole
              - createdb
      percona:
          password: qaz123
          options:
              - createrole
              - createdb

postgresql:
    cluster_name: cluster_1
    listen: 0.0.0.0:5432
    connect_address: 192.168.8.41:5432
    data_dir: /mnt/sdb/postgresql
    bin_dir: /usr/pgsql-15/bin
    pgpass: /tmp/pgpass
    authentication:
        replication:
            username: replicator
            password: replPasswd
        superuser:
            username: postgres
            password: qaz123
    parameters:
        unix_socket_directories: /var/run/postgresql/
    create_replica_methods:
        - basebackup
    basebackup:
        checkpoint: 'fast'

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: falsetype or paste code here

Copy of Config for Node 3

namespace: percona_lab
scope: cluster_1
name: node3

restapi:
    listen: 0.0.0.0:8008
    connect_address: 192.168.8.43:8008

etcd3:
#    host: fe80::a00:27ff:fe8e:9102%enp0s3:2379
    hosts: 192.168.8.31:2379,192.168.8.32:2379,192.168.8.33:2379


bootstrap:
  # this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
  dcs:
      ttl: 30
      loop_wait: 10
      retry_timeout: 10
      maximum_lag_on_failover: 1048576
      slots:
          percona_cluster_1:
            type: physical

      postgresql:
          use_pg_rewind: true
          use_slots: true
          parameters:
              wal_level: replica
              hot_standby: on
              wal_keep_segments: 10
              max_wal_senders: 5
              max_replication_slots: 10
              wal_log_hints: on
              logging_collector: 'on'

  # some desired options for 'initdb'
  initdb: # Note: It needs to be a list (some options need values, others are switches)
      - encoding: UTF8
      - data-checksums

  pg_hba: # Add following lines to pg_hba.conf after running 'initdb'
      - host replication replicator 127.0.0.1/32 trust
      - host replication replicator 0.0.0.0/0 md5
      - host all all 0.0.0.0/0 md5
      - host all all ::0/0 md5

  # Some additional users which needs to be created after initializing new cluster
  users:
      admin:
          password: qaz123
          options:
              - createrole
              - createdb
      percona:
          password: qaz123
          options:
              - createrole
              - createdb

postgresql:
    cluster_name: cluster_1
    listen: 0.0.0.0:5432
    connect_address: 192.168.8.43:5432
    data_dir: /mnt/sdb/postgresql
    bin_dir: /usr/pgsql-15/bin
    pgpass: /tmp/pgpass
    authentication:
        replication:
            username: replicator
            password: replPasswd
        superuser:
            username: postgres
            password: qaz123
    parameters:
        unix_socket_directories: /var/run/postgresql/
    create_replica_methods:
        - basebackup
    basebackup:
        checkpoint: 'fast'

tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

I have an update and to be honest I am more confused than ever. I have tried a lot of things by removing the cluster renaming the cluster

  • namespace: percona_lab2
  • scope: cluster_2
    All of this just promoted 1 of the nodes to work. It started with Node 3 working, and Node 1 and 2 having errors, Then it was node 2. And now I have Node 1. At one stage it looked promising but after restart back to square 1


Hey,
Just to make sure, you are following proper procedure.
To remove Patroni configuration from ETCD properly and initiate new cluster, you should:

  1. Stop Patroni services on all nodes.
  2. Run patronictl -c /etc/patroni/patroni.yml remove
  3. Remove content of directories “/mnt/sdb/postgresql” on all servers, and double check it was removed.
  4. Run ps -ef | grep postgres on all servers to make sure there is no other postgres running on them, and if there is any verify that it is not listening on port 5432.
  5. Start patroni on one of the nodes, verify it bootstrapped properly and was started as leader.
  6. Start another patroni services.

Could you try this approach and share the result?

2 Likes

Every PostgreSQL instance has its own system ID, which will be recorded in its control file.

This can be checked using the pg_controldata utility by passing the data directory as an argument; for example,

-bash-4.2$ pg_controldata -D /var/lib/pgsql/16/data/
pg_control version number:            1300
Catalog version number:               202307071
Database system identifier:           7384000357691318611
Database cluster state:               shut down
pg_control last modified:             Mon 22 Jul 2024 11:14:38 AM UTC

...

In a Patroni cluster, this will be stored inside the DCS (etcd) to ensure that Patroni is bringing up the right PostgreSQL instance. This system ID will be the same across all nodes of a replication cluster (Standby)

If etcd is not clean or contains the system ID of a previously used PostgreSQL Instance, the system ID won’t match after the backup restoration. This is a protection mechanism to ensure that the Patroni is dealing with the right cluster, and it is independent of the restoration tool.

So whenever nodes are reused to form a new cluster, it’s a good idea to ensure that the etcd (DCS) is clean.

patronictl remove <cluster-name>

If there is no systemid existing in the DCS, Patroni will make fresh entries on startup.

1 Like

Thanks, I am sure that this is the steps that I used. But I will follow the steps again and see if by any chance I missed something.

1 Like

Thanks, this was of one of the steps I ran when I changed the Cluster Name to cluster_2 to see if that will work. But I am going to look at everything and let you know if I am able to resolve the issue. I see the command I ran looks a bit different will confirm, this was the command I used patronictl -c /etc/patroni/patroni.yml remove cluster_1"

1 Like

Thanks for all your help " mateusz.henicz" and Profile - lalit.choudhary - Percona Community Forum

I found that when running the remove statement, I had to elevate to root else it did not delete all the content.

My Patroni Cluster Lab is now running.


Again thanks so much for your help its much appreciated.

1 Like

Hi @mateusz.henicz and @lalit.choudhary
I have configured a Patroni cluster with 3 DB nodes and 3 ETCD nodes. However, I can only see one node in the ETCD member list and the cluster may sometimes go into an uninitialized state. I would appreciate it if you could provide a solution for this issue.

I have completely removed the cluster and added a new cluster name, but I am still experiencing the same issue.