Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow migration jobs to be cleaned up after finishing #96

Closed
wants to merge 2 commits into from

Conversation

michael-lavacca-rw
Copy link

@michael-lavacca-rw michael-lavacca-rw commented Jan 5, 2024

Description

The datastore migration job that runs does not clean up after itself by default. This PR adds an optional field .datastore.migrateJobTtlSecondsAfterFinished that allows the job to be cleaned up after the specified time (in seconds) by the k8s ttl controller.

For our team's purposes this also allows us to have repeatable helm deployments in CI so that new jobs do not conflict with the old ones by name, seen in this error for reference:
Error: UPGRADE FAILED: cannot patch "appName-openfga-migrate" with kind Job: Job.batch "appName-openfga-migrate" is invalid:...

References

From k8s docs: https://kubernetes.io/docs/concepts/workloads/controllers/job/#clean-up-finished-jobs-automatically

Review Checklist

  • I have clicked on "allow edits by maintainers".
  • I have added documentation for new/changed functionality in this PR or in a PR to openfga.dev [Provide a link to any relevant PRs in the references section above]
  • The correct base branch is being used, if not main
  • I have added tests to validate that the change in functionality is working as expected

(I could not find documentation besides schema descriptors nor tests for these fields - please let me know if they exist and I will update accordingly)

@michael-lavacca-rw michael-lavacca-rw requested a review from a team as a code owner January 5, 2024 19:39
Copy link

linux-foundation-easycla bot commented Jan 5, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@rhamzeh rhamzeh changed the title Allow migration jobs to be cleaned up after finishing feat: allow migration jobs to be cleaned up after finishing Jan 8, 2024
@michael-lavacca-rw
Copy link
Author

Note that this is another solution to the issue seen in #69

@jliedy
Copy link
Contributor

jliedy commented Jan 11, 2024

Just as a heads up, this solution is incompatible with v2.0 of k8s-wait-for. v2.0 will only return success if the job exists and was successful. Any pods that get restarted or if the pod count scales up, new pods will fail to start if the job doesn't exist. You would have to set the version to v1.6 for this to work reliably.
In addition, you risk the application starting successfully even if the migration job failed as the ttlSecondsAfterFinished flag will also cause jobs that failed to be cleaned up.

@michael-lavacca-rw
Copy link
Author

michael-lavacca-rw commented Jan 11, 2024

Hi @jliedy,

Thank you for the heads-up. I was not aware the difference with the new version of k8s-wait-for that is now being used.

In trying the new version openfga-0.1.33 that includes your change, we are now unable to complete a helm deploy, receiving only this error: Error: UPGRADE FAILED: timed out waiting for the condition. We reverted back to openfga-0.1.32 and still have to delete the old jobs manually before each new deploy. We're trying to debug from the k8s-side but it seems like an issue someone else might see as well.

For reference we are updating helm charts through the Bitbucket pipeline yvogl/aws-eks-helm-deploy:1.1.0, which was working with openfga-0.1.32. Trying both WAIT="true" and WAIT="false" has no difference unfortunately.

@jliedy
Copy link
Contributor

jliedy commented Jan 12, 2024

A super handy helm plugin can be found at : https://github.com/databus23/helm-diff

That will show you all the changes to a deployment that will be made before you actually update it. It's great for validating stuff.

As far as troubleshooting: if you're not enabling the wait flag, I'd start assuming that it's not able to or is having difficulty with running the hook that deletes the job. I'd look at the version of helm that your tool is using and also check that during the deployment the migration job has been deleted (or manually delete the job).

You could also try running the helm upgrade command using the --no-hooks flag which should ignore the annotation. You could also set your values file to zero out the new default annotations for the migration job.

@jon-whit
Copy link
Collaborator

@michael-lavacca-rw Given that we're currently pinned to v2.0 of k8s-wait-for (see this line), I think it would be wise to avoid the side-effects that @jliedy has pointed out.

The original migration job was inspired by a lot of other Helm charts I have seen, particular many of them in the bitnami repository. The pattern seems to be a common pattern, and it seems the standard in the Helm ecosystem is to use chart hooks to teardown resources post deployment like this. For that reason, I'm leaning toward proceeding forward with the approach already implemented in #70 .

Can we close in favor of the work already done in #70?

@michael-lavacca-rw
Copy link
Author

Hi guys,

Yes I'll close this PR.

I have been able to get the latest 0.1.33 working with the helm hooks taking care of the teardown. The bitbucket deploy method I was using had a WAIT= parameter we needed previously but not including it altogether was the trick now (did not work with either true or false).

I'm actually trying to adopt this method now for our own custom job that does the authorization model "migration".

Thanks for the notes and a better, working solution to my issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants