Cannot restore from incremental backup.PBM and PSMDB in the same pod

Hi dears,

I’m struggling with taking a restore from Incremental backup. I’ve deployed an apllication to Openshift cluster and created 2 containers inside the same pod in order to make them able to reach each other. I can take copy as expected from PBM container. But I’m unable to restore the database from incremental backup as PBM agent cannot reach mongod, because it’s in another container.

sh-4.4$ pbm restore 2024-01-26T07:57:50Z
Starting restore 2024-01-26T08:38:17.422755382Z from '2024-01-26T07:57:50Z'.Error: check mongod binary: run: exec: "mongod": executable file not found in$PATH. stderr:
- Restore on replicaset "rs0" in state: error: check mongod binary: run: exec: "mongod": executable file not found in $PATH. stderr:
2024-01-26T08:38:01.000+0000 I [pitr] got done signal, stopping
2024-01-26T08:38:06.000+0000 I [pitr] created chunk 2024-01-26T08:28:31 - 2024-01-26T08:38:01
2024-01-26T08:38:06.000+0000 I [pitr] pausing/stopping with last_ts 2024-01-26 08:38:01 +0000 UTC
2024-01-26T08:38:17.000+0000 I got command restore [name: 2024-01-26T08:38:17.422755382Z, snapshot: 2024-01-26T07:57:50Z] <ts: 1706258297>
2024-01-26T08:38:17.000+0000 I got epoch {1706258271 2}
2024-01-26T08:38:17.000+0000 I [restore/2024-01-26T08:38:17.422755382Z] backup: 2024-01-26T07:57:50Z
2024-01-26T08:38:17.000+0000 I [restore/2024-01-26T08:38:17.422755382Z] recovery started
2024-01-26T08:38:17.000+0000 D [restore/2024-01-26T08:38:17.422755382Z] port: 27637
2024-01-26T08:38:17.000+0000 E [restore/2024-01-26T08:38:17.422755382Z] restore: check mongod binary: run: exec: "mongod": executable file not found in $PATH. stderr: 
2024-01-26T08:38:17.000+0000 D [restore/2024-01-26T08:38:17.422755382Z] hearbeats stopped
2024-01-26T08:39:16.000+0000 D [pitr] start_catchup [oplog only]

How can I redirect traffic to the PSMDB container to stop mongo daemon? It’s located in /usr/bin folder

I’d like to mention that I cannot use Operator as I need Mongo 7 and Incremental backup

Thanks from advance for assistance

Hi @Iliterallyneedhelp ,

Restore on replicaset “rs0” in state: error: check mongod binary: run: exec: “mongod”: executable file not found in $PATH. stderr:

The error means that pbm-agent performed pre-checks and couldn’t find mongod binary in its $PATH. The “actual” restore hasn’t started yet.

During Physical Restore, PBM agents shut down (remotely) the mongod process by sending db.adminCommand("shutdown"), copies backup files to the dbpath, and exec mongod --dbpath=[...] to perform “post-restore” actions.

Make sure:

  • PBM Agent container has the same mongod binary as the PSMDB container. And its $PATH contains the mongod (i.e., pbm-agent can exec mongod process)
  • PSMDB dbpath is mounted volume (i.e., it’s accessible for other containers)
  • PBM Agent container has the same mounted volume as the PSMDB container (i.e., it can read/write the PSMDB dbpath content)
  • PBM Agent is run under the same User ID or Group ID as the PSMDB mongod process (i.e., it has the same permissions for read/write for the dbpath)
1 Like

Hi @Dmytro_Zghoba

Thanks for your answer. I finally realized why mongod is required in PBM Agent container, and the below one was the issue indeed. I used your response from the past as an example

PBM Agent container has the same mongod binary as the PSMDB container. And its $PATH contains the mongod (i.e., pbm-agent can exec mongod process)

I’m just curious about one thing.

mongod is the processes that keeps container alive, but container needs to stop it in order to take a backup. Additionally after stopping the mongod, PBM Agent container is going to be shut down within 5 minutes. I’ve large database about 150GB so restore is going to take a while.

Could you please advise what’s the best approach to keep these 2 containers alive, while restore is in progress?

Thanks once more Dmytro

For restore purposes, I’d change the restart policy and liveness checks for the containers/pod.
Make the pod run with the PBM container only.
At the end of the Physical Restore, the PBM container also stops, and you will need to start the whole cluster manually.

It may look like:

  1. delete Services/Endpoints i.e. stop client connections
  2. restart the cluster with updated configs for pod/containers
  3. run restore
  4. (when restore is done) run cluster with normal pod/containers configs
  5. create services/Endpoints again

You cannot have a partially run cluster because data consistency and internal cluster state will be broken. Until full cluster recovery, you should not allow any client connections.

I suggest you look at Percona Operator for MongoDB. It is open source product and is available for free (Apache 2.0)

Hi again, @Dmytro_Zghoba

I’ve followed your recommendation and I’m receiving such logs after start-agent.sh edit. I tried to disable restarting policy inside PMB container

pbm config --file pbm-config.yaml
Error: connect to mongodb: create mongo connection: mongo ping: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongodb-server:27017, Type: Unknown, Last error: dial tcp 172.30.12.91:27017: connect: connection refused }, ] }
+ exec pbm-agent
2024/01/29 08:07:05 Exit: connect to PBM: create mongo connection: mongo ping: server selection error: server selection timeout, current topology: { Type: Single, Servers: [{ Addr: mongodb-server:27017, Type: Unknown, Last error: dial tcp 172.30.12.91:27017: connect: connection refused }, ] }

It’s kinda obvious that mongodb server cannot be reached as I followed your recommendation to run PBM Agent container only. Which file and how can I edit to make pbm commands to be working?

my current start-agent.sh looks like this:

#!/bin/bash

for argv; do
        if [[ -n "$usenext" ]]; then
                export PBM_MONGODB_URI="${argv}"
                break
        fi
        if [[ "$argv" == '--mongodb-uri' ]]; then
                use_next='true'
                # TODO should we check if last?
                continue
        elif [[ "$argv" == '--mongodb-uri='* ]]; then
                export PBM_MONGODB_URI="${argv#--mongodb-uri=}"
                break
        fi
done

# TODO should we check if all parts are set?
set +o xtrace
[[ -z "$PBM_MONGODB_URI" ]] && export PBM_MONGODB_URI="mongodb://${PBM_AGENT_MONGODB_USERNAME}:${PBM_AGENT_MONGODB_PASSWORD}@localhost:${PBM_MONGODB_PORT}/?replicaSet=${PBM_MONGODB_REPLSET}"
set -o xtrace

if [ "$RESTORE_MODE" != "true" ]; then
        if [ "${1:0:9}" = "pbm-agent" ]; then
                OUT="$(mktemp)"
                OUT_CFG="$(mktemp)"
                timeout=5
                for i in {1..10}; do
                        # ARM image doesn't contain mongo CLI, preliminary check is skipped, PBM will return error in case of connection failure
                        if [ ! -e "/usr/bin/mongo" ]; then
                                break
                        fi
                        if [ "${SHARDED}" ]; then
                                echo "waiting for sharded scluster"

                                # check in case if shard has role 'shardsrv'
                                set +o xtrace
                                mongo "${PBM_MONGODB_URI}" --eval="db.isMaster().\$configServerState.opTime.ts" --quiet | tee "$OUT"
                                set -o xtrace
                                exit_status=$?

                                # check in case if shard has role 'configsrv'
                                set +o xtrace
                                mongo "${PBM_MONGODB_URI}" --eval="db.isMaster().configsvr" --quiet | tail -n 1 | tee "$OUT_CFG"
                                set -o xtrace
                                exit_status_cfg=$?

                                ts=$(grep -E '^Timestamp\([0-9]+, [0-9]+\)$' "$OUT")
                                isCfg=$(grep -E '^2$' "$OUT_CFG")

                                if [[ "${exit_status}" == 0 && "${ts}" ]] || [[ "${exit_status_cfg}" == 0 && "${isCfg}" ]]; then
                                        break
                                else
                                        sleep "$((timeout * i))"
                                fi
                        else
                                set +o xtrace
                                mongo "${PBM_MONGODB_URI}" --eval="(db.isMaster().hosts).length" --quiet | tee "$OUT"
                                set -o xtrace
                                exit_status=$?
                                rs_size=$(grep -E '^([0-9]+)$' "$OUT")
                                if [[ "${exit_status}" == 0 ]] && [[ $rs_size -ge 1 ]]; then
                                        break
                                else
                                        sleep "$((timeout * i))"
                                fi
                        fi
                done

                rm "$OUT"
        fi
else
   echo "Restore mode activated"
fi     
pbm config --file pbm-config.yaml
exec "$@"