Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploy filestore backups to all GCP clusters #4683

Closed
3 tasks done
Tracked by #4391
yuvipanda opened this issue Aug 26, 2024 · 17 comments
Closed
3 tasks done
Tracked by #4391

Deploy filestore backups to all GCP clusters #4683

yuvipanda opened this issue Aug 26, 2024 · 17 comments
Assignees

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Aug 26, 2024

Enable GCP filestore backups project on all GCP clusters, keeping backups for a maximum of 2 days to minimise costs. Documentation for enabling this feature is here: https://infrastructure.2i2c.org/howto/filesystem-management/filesystem-backups/enable-backups/#gcp

Definition of Done

  • gcpFilestoreBackups is enabled on all GCP clusters (see list in comment below)
  • It has been verified that backups are created via the GCP console
  • It has been verified that backups are deleted after 2 days via the GCP console
@yuvipanda
Copy link
Member Author

Me and @sgibson91 have decided to defer this to october based on current engineering capacity

@sgibson91
Copy link
Member

sgibson91 commented Jan 30, 2025

@sgibson91 sgibson91 self-assigned this Feb 6, 2025
@sgibson91
Copy link
Member

Trying to deploy this to the 2i2c-uk cluster generated a lot of weird errors from helm:

Error: UPGRADE FAILED: Unable to continue with update: ServiceAccount "default" in namespace "support" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set to "support"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "support"

@sgibson91
Copy link
Member

Ok, this error is cropping up for awi-ciroh too. So I suspect that it will be for all of them.

@sgibson91
Copy link
Member

Aha! This line is present in 2i2c cluster's support config, but not in the docs:

name: gcp-filestore-backups-sa

Adding that creates a successful rollout of helm, but the pod is now erroring.

I will fix the docs and investigate the crashing pod.

@sgibson91
Copy link
Member

sgibson91 commented Feb 6, 2025

Cause of the pod erroring:

ERROR: (gcloud.filestore.backups.create) NOT_FOUND: The resource "projects/two-eye-two-see-uk/locations/europe-west2/instances/two-eye-two-see-uk-homedirs" was not found. This command is authenticated as two-eye-two-filestore-backup@two-eye-two-see-uk.iam.gserviceaccount.com which is the active account specified by the [core/account] property.

It has the correct service account, but that maybe does not have the correct permissions. But it should if the terraform is correct.

@sgibson91
Copy link
Member

Fixed! I defined the zone incorrectly in config.

@sgibson91
Copy link
Member

All GCP clusters now have gcpFilestoreBackups enabled with a retention period of 2 days. I will monitor over the next couple of days to check i) backups are being created, ii) the correct number of backups are being retained

@sgibson91
Copy link
Member

I have verified via the GCP console that all clusters have had a backup created. I will check again on Monday that we are only retaining 2 days' worth of backups.

@sgibson91
Copy link
Member

The backups are not being cleaned out. On 2i2c-uk, we have 4 backups instead of 2. Investigating.

@sgibson91
Copy link
Member

I opened 2i2c-org/gcp-filestore-backups#26 that I believe will help track the root cause and will pick that up as unplanned next

@sgibson91
Copy link
Member

Running describe on the deployment, I see the retention period variable is not being passed to the script via helm config, despite being set

Containers:
   gcpfilestorebackups:
    Image:      quay.io/2i2c/gcp-filestore-backups:0.0.1-0.dev.git.53.h8b9558e
    Port:       <none>
    Host Port:  <none>
    Command:
      python
      gcp-filestore-backups.py
    Args:
      two-eye-two-see-uk-homedirs
      two-eye-two-see-uk
      europe-west2-b
    Environment:   <none>
    Mounts:        <none>

I'm going to tweak some things, make a new release of the chart, and see if that resolves the issue

@sgibson91
Copy link
Member

sgibson91 commented Feb 10, 2025

What's going on here is that the helm templating is mangling all of my input arguments to the python script, causing errors because the right values are not paired with the right parameters. I'll continue battling.

@sgibson91
Copy link
Member

k logs support-gcpfilestorebackups-7d965dcb84-jrp84
2025-02-10 18:06:20.383 | INFO     | __main__:<module>:223 - command that was run: ['gcp-filestore-backups.py', '--filestore-name=two-eye-two-see-uk-homedirs', '--project=two-eye-two-see-uk', '--zone=europe-west2-b', '--retention-days-=2', '--backup-freq-days=1']
usage: gcp-filestore-backups.py [-h] [--filestore-name FILESTORE_NAME]
                                [--project PROJECT] [--zone ZONE]
                                [--filestore-share-name FILESTORE_SHARE_NAME]
                                [--retention-days RETENTION_DAYS]
                                [--backup-freq-days BACKUP_FREQ_DAYS]
gcp-filestore-backups.py: error: unrecognized arguments: --retention-days-=2

Got down to this being the error, but I know I've fixed that typo. For some reason the latest helm chart release hasn't picked it up yet. Tomorrow, I'll force another release.

@sgibson91
Copy link
Member

To get here, I had to get rid of support for backing up multiple filestores and drastically simplify the helm templating and argument input into the python script. Since I imagine we will want to migrate everyone to per-user storage solution when that is ready, I don't think the loss of this feature is that critical.

@sgibson91
Copy link
Member

Forcing a new release of the helm chart worked!

@sgibson91
Copy link
Member

All the projects now only have two backups 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants