This document will walk you through different considerations when using Velero to backup TAP. Using Velero to backup and recover TAP is currently NOT tested or supported as of TAP 1.6.
- You are using a multi-cluster reference architecture.
- You are using GitOps RI to install and manage your TAP clusters.
- The Pod Convention Services need to be restarted to pick up new certicates.
- Builds for existing Workloads are in a stuck status.
- New workloads applied to the cluster work successfully after fixing certificate issues.
- TMC will no longer be managing TAP on the new cluster (If TMC installed TAP)
- Unidentified issues with TBS and old build runs (Need to research)
- Velero won't restore cluster level resources if you recover only a namespace.
- How will the IRSA/Service Account setup work on a recovered cluster?
- What are the impacts to the GitOps RI on the new cluster?
-
If you've automated your installation of TAP via GitOps and the management of your Workloads, Deliverables, and Configurations via GitOps then it's best to consider allow GitOps to reconcile your cluster and workloads to a healthy state.
-
Velero should be considered for backups and recovery of resources like PVC's or certain applications/namespaces, but shouldn't be considered for full cluster recovery.
When you determining the backup strategy of your cluster it's best to consider a layered and ordered approach to recovering a cluster. This might involve a mixture of using TMC/Terraform to stand up the cluster itself, using GitOps to install the base components of TAP, and then an ordered sequence of restores executed from Velero for individual namespaces or application needs that might need to be ordered in a certain sequence.
Each cluster type will have it's own backup/recover needs along with RPO/RTO metrics.
Considerations for the View Cluster include the installed databases, workshops, and accelerators:
- Are the databases for metadata store and TAP GUI externalized? If externalized then Velero isn't needed.
- What are their backup/snapshot schedules?
- How are you installing/managing Accelerators? GitOps?
- How are you installing/managing Workshops? GitOps?
Considerations for the Build Cluster include the following:
- Do you need to maintain build history?
- How are you managing the creation of developer namespaces? Can you recover them via GitOps?
- Can the sizing/throughput of your Build Cluster manage rebuilding all workloads?
Considerations for the Run Cluster include the following:
- Do you have an Active/Active, Active/Passive, etc. setup?
- Are your kubernetes resources (Deliverables) managed via GitOps? Can you use GitOps to recover workloads?
- How are you managing and installing configurations/secrets? GitOps? Can you use GitOps to recover configurations?
- What are the Backup/Recovery needs for Shared Services installed on the Cluster? Do these have their own backup mechanisms?
- PersistentVolume recover should be considered if used by applications via Velero.
Considerations for the Iterate Cluster include the following:
- Are the databases for metadata store externalized? If externalized then Velero isn't needed.
- What are their backup/snapshot schedules?
- Do you need to maintain build history?
- How are you managing the creation of developer namespaces? Can you recover them via GitOps?
- Can the sizing/throughput of your Iterate Cluster manage rebuilding all workloads?
Overall, the Iterate cluster is typically ephemeral in nature. Take into consideration if workload recovery is needed.
It is recommended to exclude certain cert manager resources when backing up and recovering. You can find guidance here: