Skip to content

Database Backup and Restore

Wilson Wong edited this page Nov 20, 2024 · 13 revisions

Table of Contents

Resources

Below are some of the resources that were used in order to achieve our backup strategy:

Backup

The main goal is to be able to generate full database snapshot dumps at a regular schedule and store these files in a different location. In order to achieve this, we need an off-cluster space to store exported files, and a scheduled job that will perform the database snapshots. Currently two versions of backups are created, one version stored in a Persistent Volume Claim (PVC), another to an S3 bucket.

Storage Provision

Backing up with netapp PVC

On OCP4, the storage class we will normally be using for backup is called netapp-file-backup. More background details can be found here.

If your target namespace does not have a backup PVC yet, run the following template:

export NAMESPACE=YOURNAMESPACE

oc process -n $NAMESPACE -f https://raw.githubusercontent.com/wiki/bcgov/nr-permitconnect-navigator-service/assets/templates/backup.pvc.yaml -p NAMESPACE=$NAMESPACE -o yaml | oc -n $NAMESPACE apply -f -

You should expect to see a new PVC with the following pattern in your namespace: backup-YOURNAMESPACE.

Backing up with s3 bucket

An NRM service desk ticket is required to provision an s3 bucket for storage if they do not exist. Request should be under "Object Storage and GeoDrive Access" Note that when requesting a bucket for backup purposes, you may need to provide a retention policy.

When provisioned, bucket credentials will be provided for you.

Backup Management

For managing the backup operations, we leverage BCDevOps/backup-container. The backup-container generates database dumps as well as database dump cycling based on a specified retention schedule. It supports two modes of operation: a legacy mode where there is a long-standing deployment which sleeps between job executions, and a CronJob mode which is managed by OpenShift. After evaluating both approaches, we have elected to use the CronJob mode as it is less resource intensive on the OCP platform, needing to have a running pod instantiated only when the backup task needs to run.

Note: At the time of writing, we are currently using version 2.9.0 of the backup-container.

Network Policies

On OCP4, you will need to ensure that your backup-postgres container is allowed to connect with the Patroni database. To do that, run the following template:

export NAMESPACE=YOURNAMESPACE
export INSTANCE=YOURAPPDBINSTANCE

oc process -n $NAMESPACE -f https://raw.githubusercontent.com/wiki/bcgov/nr-permitconnect-navigator-service/assets/templates/backup.np.yaml -p INSTANCE=$INSTANCE -o yaml | oc -n $NAMESPACE apply -f -

Note: The above template assumes you are connecting to a Patroni cluster. If you are connecting to a Postgres database instead, make sure you manually update the podSelector labels accordingly depending on your deployment configuration!

If you haven't already done so, make sure you have set up dockerhub auth credentials in your namespace for the builder. More details can be found here.

Deploy

Once your NFS PVC and s3 buckets have been provisioned and your Network Policies have been applied, you can then deploy the OpenShift CronJob which will manage your database backups. You can figure out what your NFS PVC is called with the following command:

export NAMESPACE=YOURNAMESPACE

oc get -n $NAMESPACE pvc

On OCP4. you will be looking for a PVC with StorageClass of netapp-file-backup with a name following the pattern backup-YOURNAMESPACE.

With s3 buckets, you will need to setup the credentials obtained after provisioning and place them in your secrets in order for the cronjob below to access them. In the appropriate environment in OCP4, under secrets, create a category named "backup-s3" with the following keys and appropriate values:

  • S3_BUCKET
  • S3_ENDPOINT
  • S3_USER
  • S3_PASSWORD

If provided, the cronjob below will place the backups into the buckets. If the credentials are not provided, no s3 backup will be made.

A copy of the template is here: backup-cronjob.yaml Run the following, with the template file either stored locally or via a URL, updating the export variables as necessary:

export NAMESPACE=YOURNAMESPACE
export GIT_REF=2.9.0
export PVC=backup-$NAMESPACE
export DBNAME=$(oc -n $NAMESPACE get secret patroni-master -o jsonpath='{.data.app-db-name}' | base64 -d)

oc process \
-n $NAMESPACE \
-f "./backup-cronjob.yaml" \
-p JOB_PERSISTENT_STORAGE_NAME=$PVC \
-p IMAGE_REGISTRY=docker.io \
-p IMAGE_NAMESPACE=bcgovimages \
-p SOURCE_IMAGE_NAME=backup-container \
-p TAG_NAME=$GIT_REF \
-p DATABASE_SERVICE_NAME=patroni-master \
-p DATABASE_NAME=$DBNAME \
-p DATABASE_DEPLOYMENT_NAME=patroni-master \
-p DATABASE_USER_KEY_NAME=app-db-username \
-p DATABASE_PASSWORD_KEY_NAME=app-db-password \
-p SCHEDULE="0 8 * * *" \
-p DAILY_BACKUPS=7 \
-p WEEKLY_BACKUPS=8 \
-p MONTHLY_BACKUPS=3 \
-o yaml | oc -n $NAMESPACE apply -f -

Note that for the purposes of our backup retention policy, we have elected to have our CronJob run daily at 1 AM, and we will be keeping 7 daily backups, 8 weekly backups, and 3 monthly backups. With this schedule, we will be retaining at minimum the state of the database 90 days ago, and at most 119 days ago (4 months, or 120 days minus 1). We will also have more frequent resolution for recent database backups coming from the daily and weekly backups.

Updating the Backup Container Version

Change the version to the latest release in backup-postgres yaml and save:
https://console.apps.silver.devops.gov.bc.ca/k8s/ns/d9d78e-<env>/cronjobs/backup-postgres/yaml

Updating the CronJob

In the event the CronJob parameters need to be updated, you will not be able to just oc apply and edit the existing CronJob object due to an int32 parsing error. Instead, you will need to know which part of the cronjob manifest you are intending on patching over. For example, if you want to pause/suspend an existing cronjob, you could run the following:

export NAMESPACE=YOURNAMESPACE

oc patch -n $NAMESPACE cronjob backup-postgres -p '{"spec":{"suspend":true}}'

If you want to resume a paused cronjob, you could run the following:

export NAMESPACE=YOURNAMESPACE

oc patch -n $NAMESPACE cronjob backup-postgres -p '{"spec":{"suspend":false}}'

If you want to change the cronjob's schedule, you could run the following:

export NAMESPACE=YOURNAMESPACE

oc patch -n $NAMESPACE cronjob backup-postgres -p '{"spec":{"schedule":"0 8 * * *"}}'

If you wanted to change backup cronjobs across all of the namespaces you manage, you could run the following:

oc projects -q | xargs -t -n1 oc patch cronjob backup-postgres -p '{"spec":{"schedule":"0 8 * * *"}}' -n

Deleting the CronJob

Since there is an int32 parsing error when running oc apply on an existing cronjob, you will unfortunately need to delete the existing cronjob first, before redeploying it. After the old cronjob is cleared, re-run the steps in the previous section.

To delete the CronJob object, you can run the following:

export NAMESPACE=YOURNAMESPACE

oc delete -n $NAMESPACE cronjob backup-postgres

Manual Run

Using the schedule "0 8 * * *", the cronjob will run with a new pod at 1 AM daily. However, manual runs can also be done if you're looking to get an immediate snapshot of the state of the database (potentially for auditing or restore/migration operations). You can manually run the cronjob controller with the following:

export NAMESPACE=<YOURNAMESPACE>

oc create -n $NAMESPACE job --from=cronjob/backup-postgres "backup-postgres-manual-$(date +%s)"

If the backup runs successfully, it will show as 'Completed', and you can inspect the logs.

Make sure to clean up your temporary pod when you are done with the following:

export NAMESPACE=<YOURNAMESPACE>

# Note: When there are no jobs to delete, you will get an error for oc delete.
oc get job -n $NAMESPACE -o name | grep -F -e '-manual-' | xargs oc delete -n $NAMESPACE

Recovery

While there exists documentation on how to leverage the backup-container to perform a restore here, because we are running the backup in cron mode, there will not be any long-running pod containers that you would be able to oc rsh into. Because of this, we will need to manually connect to our database and perform the restore procedure ourselves.

Acquire Backup

If the database in question is being regularly backed up with the backup-container, we can leverage the dump generated by that cronjob to perform the restore. You can utilize either the PVC or S3 backup.

Netapp PVC backup

In order to get the backup files, we will need to debug the pod representing the latest run of the cronjob. The following command will figure out which is the most recent pod, and start up a debug instance of it:

export NAMESPACE=<YOURNAMESPACE>

oc debug -n $NAMESPACE $(oc get -n $NAMESPACE pods -o name | grep "backup-postgres" | tail -n1)

Once your debug pod is running, you can run the command ./backup.sh -l to list all of the backups available on the mounted backup PVC. Make a note of which directory you want to get - normally you will be looking for the latest one under the /backups/daily/<DATE> directory, where DATE represents today in YYYY-MM-DD format.

To copy the latest backup folder to your local machine, run the following:

export NAMESPACE=<YOURNAMESPACE>

mkdir backups
oc rsync -n $NAMESPACE $(oc get -n $NAMESPACE pods -o name | grep "debug$" | tail -n1 | sed "s|pod/||g"):/backups/daily/$(date +%Y-%m-%d)/ ./backups

If the operation was successful, you should see at least one file in the local backups folder with an extension of .sql.gz. Make a note of this filename as you will need it later.

S3 bucket backup

You can use a file browser to download the file.

Utilize the bucket credentials to access the bucket, then download the correct backup file with an extension of .sql.gz. Make a note of this filename as you will need it later.

Connect to Remote DB

To wipe and restore the DB, you will need a working connection to your target database by running oc port-forward. If you are connecting to a Patroni DB on OpenShift, you can run the following command:

export NAMESPACE=<YOURNAMESPACE>
export DBPORT=<YOURDBPORT>

oc -n $NAMESPACE port-forward $(oc -n $NAMESPACE get pods -o name --selector role=master,cluster-name=master) $DBPORT:5432

If successful, your local machine will be able to forward any network traffic to localhost:15432 to the remote Patroni cluster named master. The subcommand will auto-select the correct Patroni replica which is serving as the master.

Wipe Existing DB

Before wiping the database, you must terminate ALL open network connections to the database. You should temporarily drop any dependent client application replica counts to 0. You may need to kill the master Patroni node if there are still residual connections preventing a database drop.

Assuming you have a working connection to your target database, run the following commands to do a hard wipe of the database. This is a destructive operation: make sure you know what you are doing before proceeding!

export DBHOST=<YOURDBHOST>
export DBPORT=<YOURDBPORT>

export DBNAME=$(oc -n $NAMESPACE get secret patroni-master -o jsonpath='{.data.app-db-name}' | base64 -d)
export DBSUPERUSER=$(oc -n $NAMESPACE get secret patroni-master -o jsonpath='{.data.superuser-username}' | base64 -d)
export DBUSER=$(oc -n $NAMESPACE get secret patroni-master -o jsonpath='{.data.app-db-username}' | base64 -d)
export PGPASSWORD=$(oc -n $NAMESPACE get secret patroni-master -o jsonpath='{.data.superuser-password}' | base64 -d)

# terminate current connections and drop database:
psql -h $DBHOST -p $DBPORT -U $DBSUPERUSER -ac "SELECT pg_terminate_backend(pg_stat_activity.pid) FROM pg_stat_activity WHERE pg_stat_activity.datname = '$DBNAME' AND pid <> pg_backend_pid()" -ac "DROP DATABASE $DBNAME;"
# create an empty database
psql -h $DBHOST -p $DBPORT -U $DBSUPERUSER -ac "CREATE DATABASE $DBNAME;"
psql -h $DBHOST -p $DBPORT -U $DBSUPERUSER -ac "GRANT ALL ON DATABASE $DBNAME TO $DBUSER;"

You can check to see if the database has been re-instantiated correctly with the following command:

psql -h $DBHOST -p $DBPORT -U $DBSUPERUSER -ac "\l"

Restore from Backup

Once you have verified you have a fresh and empty database from the previous section, you can leverage your backup file to repopulate the schema and restore your data. Unzip the backup use that as the source to restore from in the following command.

Note 1: The backup generated by backup-container will automatically assign each table to be owned by an app user. For most of our cases, the user will just be called app. If you get any errors related to the role app not being found, you can either hard redeploy your Patroni cluster, or manually jump into the DB and create the missing user.

_Note 2: roles app, postgres and replication may already exist if patroni is running. You can edit your sql file to remove those CREATE role statements.

Note 3: If you end up getting an error like stdin is not a tty on Windows MINGW64, replace psql with psql.exe instead (SO Reference).

Note 4: DBPORT can be any unused port on your system. However, using 15432 is usually a safe bet.

export BACKUPSOURCE=<YOURSOURCEFILE>
export NAMESPACE=<YOURNAMESPACE>
export DBHOST=<YOURDBHOST>
export DBPORT=<YOURDBPORT>
export DBNAME=$(oc -n $NAMESPACE get secret patroni-master -o jsonpath='{.data.app-db-name}' | base64 -d)
export DBSUPERUSER=$(oc -n $NAMESPACE get secret patroni-master -o jsonpath='{.data.superuser-username}' | base64 -d)
export DBUSER=$(oc -n $NAMESPACE get secret patroni-master -o jsonpath='{.data.app-db-username}' | base64 -d)
export PGPASSWORD=$(oc -n $NAMESPACE get secret patroni-master -o jsonpath='{.data.superuser-password}' | base64 -d)

psql -h $DBHOST -p $DBPORT -U $DBSUPERUSER -d $DBNAME -f $BACKUPSOURCE

You can check to see if the database tables have been restored correctly with the following command:

PGPASSWORD=$(oc -n $NAMESPACE get secret patroni-master -o jsonpath='{.data.app-db-password}' | base64 -d) psql -h $DBHOST -p $DBPORT -U $DBUSER -d $DBNAME -c "\d"

If you temporarily set your client application replica count to 0, you should revert it back to the original count. Your application should now be functional assuming the database restore was successful.