feat: allow migration jobs to be cleaned up after finishing #96

michael-lavacca-rw · 2024-01-05T19:39:01Z

Description

The datastore migration job that runs does not clean up after itself by default. This PR adds an optional field .datastore.migrateJobTtlSecondsAfterFinished that allows the job to be cleaned up after the specified time (in seconds) by the k8s ttl controller.

For our team's purposes this also allows us to have repeatable helm deployments in CI so that new jobs do not conflict with the old ones by name, seen in this error for reference:
Error: UPGRADE FAILED: cannot patch "appName-openfga-migrate" with kind Job: Job.batch "appName-openfga-migrate" is invalid:...

References

From k8s docs: https://kubernetes.io/docs/concepts/workloads/controllers/job/#clean-up-finished-jobs-automatically

Review Checklist

I have clicked on "allow edits by maintainers".
I have added documentation for new/changed functionality in this PR or in a PR to openfga.dev [Provide a link to any relevant PRs in the references section above]
The correct base branch is being used, if not main
I have added tests to validate that the change in functionality is working as expected

(I could not find documentation besides schema descriptors nor tests for these fields - please let me know if they exist and I will update accordingly)

linux-foundation-easycla · 2024-01-05T19:39:05Z

The committers listed above are authorized under a signed CLA.

✅ login: michael-lavacca-rw (db6ae7f, fb694eb)

Signed-off-by: Michael Lavacca <[email protected]>

michael-lavacca-rw · 2024-01-10T13:27:46Z

Note that this is another solution to the issue seen in #69

jliedy · 2024-01-11T19:34:49Z

Just as a heads up, this solution is incompatible with v2.0 of k8s-wait-for. v2.0 will only return success if the job exists and was successful. Any pods that get restarted or if the pod count scales up, new pods will fail to start if the job doesn't exist. You would have to set the version to v1.6 for this to work reliably.
In addition, you risk the application starting successfully even if the migration job failed as the ttlSecondsAfterFinished flag will also cause jobs that failed to be cleaned up.

michael-lavacca-rw · 2024-01-11T20:45:34Z

Hi @jliedy,

Thank you for the heads-up. I was not aware the difference with the new version of k8s-wait-for that is now being used.

In trying the new version openfga-0.1.33 that includes your change, we are now unable to complete a helm deploy, receiving only this error: Error: UPGRADE FAILED: timed out waiting for the condition. We reverted back to openfga-0.1.32 and still have to delete the old jobs manually before each new deploy. We're trying to debug from the k8s-side but it seems like an issue someone else might see as well.

For reference we are updating helm charts through the Bitbucket pipeline yvogl/aws-eks-helm-deploy:1.1.0, which was working with openfga-0.1.32. Trying both WAIT="true" and WAIT="false" has no difference unfortunately.

jliedy · 2024-01-12T14:26:47Z

A super handy helm plugin can be found at : https://github.com/databus23/helm-diff

That will show you all the changes to a deployment that will be made before you actually update it. It's great for validating stuff.

As far as troubleshooting: if you're not enabling the wait flag, I'd start assuming that it's not able to or is having difficulty with running the hook that deletes the job. I'd look at the version of helm that your tool is using and also check that during the deployment the migration job has been deleted (or manually delete the job).

You could also try running the helm upgrade command using the --no-hooks flag which should ignore the annotation. You could also set your values file to zero out the new default annotations for the migration job.

jon-whit · 2024-01-12T17:07:25Z

@michael-lavacca-rw Given that we're currently pinned to v2.0 of k8s-wait-for (see this line), I think it would be wise to avoid the side-effects that @jliedy has pointed out.

The original migration job was inspired by a lot of other Helm charts I have seen, particular many of them in the bitnami repository. The pattern seems to be a common pattern, and it seems the standard in the Helm ecosystem is to use chart hooks to teardown resources post deployment like this. For that reason, I'm leaning toward proceeding forward with the approach already implemented in #70 .

Can we close in favor of the work already done in #70?

michael-lavacca-rw · 2024-01-12T17:19:24Z

Hi guys,

Yes I'll close this PR.

I have been able to get the latest 0.1.33 working with the helm hooks taking care of the teardown. The bitbucket deploy method I was using had a WAIT= parameter we needed previously but not including it altogether was the trick now (did not work with either true or false).

I'm actually trying to adopt this method now for our own custom job that does the authorization model "migration".

Thanks for the notes and a better, working solution to my issue!

michael-lavacca-rw requested a review from a team as a code owner January 5, 2024 19:39

feat: allow ttlSecondsAfterFinished to be set for migrate jobs

db6ae7f

Signed-off-by: Michael Lavacca <[email protected]>

michael-lavacca-rw force-pushed the main branch from aec2254 to db6ae7f Compare January 5, 2024 19:47

rhamzeh changed the title ~~Allow migration jobs to be cleaned up after finishing~~ feat: allow migration jobs to be cleaned up after finishing Jan 8, 2024

michael-lavacca-rw mentioned this pull request Jan 10, 2024

Upgrading openfga helm chart ends in failure due to migration job #69

Closed

Merge branch 'main' of https://github.com/openfga/helm-charts

fb694eb

michael-lavacca-rw closed this Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow migration jobs to be cleaned up after finishing #96

feat: allow migration jobs to be cleaned up after finishing #96

michael-lavacca-rw commented Jan 5, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jan 5, 2024 •

edited

Loading

michael-lavacca-rw commented Jan 10, 2024

jliedy commented Jan 11, 2024 •

edited

Loading

michael-lavacca-rw commented Jan 11, 2024 •

edited

Loading

jliedy commented Jan 12, 2024 •

edited

Loading

jon-whit commented Jan 12, 2024

michael-lavacca-rw commented Jan 12, 2024

feat: allow migration jobs to be cleaned up after finishing #96

feat: allow migration jobs to be cleaned up after finishing #96

Conversation

michael-lavacca-rw commented Jan 5, 2024 • edited Loading

Description

References

Review Checklist

linux-foundation-easycla bot commented Jan 5, 2024 • edited Loading

michael-lavacca-rw commented Jan 10, 2024

jliedy commented Jan 11, 2024 • edited Loading

michael-lavacca-rw commented Jan 11, 2024 • edited Loading

jliedy commented Jan 12, 2024 • edited Loading

jon-whit commented Jan 12, 2024

michael-lavacca-rw commented Jan 12, 2024

michael-lavacca-rw commented Jan 5, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jan 5, 2024 •

edited

Loading

jliedy commented Jan 11, 2024 •

edited

Loading

michael-lavacca-rw commented Jan 11, 2024 •

edited

Loading

jliedy commented Jan 12, 2024 •

edited

Loading