Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Argo Workflows to latest version #1639

Merged
merged 25 commits into from
Apr 26, 2023

Conversation

Adam-D-Lewis
Copy link
Member

@Adam-D-Lewis Adam-D-Lewis commented Feb 1, 2023

Reference Issues or PRs

Related to #1495, but I wouldn't say this closes it necessarily.

What does this implement/fix?

Enables basic functionality of previously broken argo integration.

The Argo Workflows integration did not allow end users to submit workflows. I looked into this a bit, and found at the moment, you can attempt to submit workflows in 2 ways:
1) From the UI
2) using the argo CLI

Neither way was working for the example argo workflow. This was because the workflow needed some permissions which were not granted by default. This comment (argoproj/argo-workflows#2522 (comment)) gives a decent review of the 3 relevant service accounts at play. We haven't adequately configured the 3rd one listed, the container service account used by the actual pods spun up as part of the workflow. By default, the pods use the default service account for the namespace where the workflow is deployed. That default service account has no permissions by default.

I solved this issue by setting the helm chart configuration to create a service account that workflows can use. I then set the workflow controller config to use that service account by default for all workflows (the service account used can still be overridden). I also upgraded the helm chart, and added some additional user guide documentation on how to submit a workflow from the Argo CLI.

Put a x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

  • Did you test the pull request locally? I tested by running a local kind cluster and submitting workflows as explained in the docs I added.
  • Did you add new tests?

Comment on lines +20 to +21
You can submit or manage workflows via the Argo CLI. The Argo CLI can be downloaded from the
[Argo Releases](https://github.com/argoproj/argo-workflows/releases) page. After downloading the CLI, you can get your
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done outside of Nebari. I'm not sure if we want to provide some better solution to install the CLI tool inside Nebari. It's not available via conda that I could find.

Copy link
Member Author

@Adam-D-Lewis Adam-D-Lewis Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Users can download the binary from the releases page and use it inside of Nebari as well

Copy link
Member

@costrouc costrouc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested out this PR but I'll like @iameskild to review this PR so that we can have multiple experts with this tool. I think that Argo is critical to Nebari and we need to ensure the user experience is great with this component. kbatch looks like it is not actively being managed and this will likely be our long term workflow solution.

@costrouc costrouc requested a review from iameskild February 2, 2023 15:26
@costrouc
Copy link
Member

costrouc commented Feb 2, 2023

@iameskild can you review/work with @Adam-D-Lewis on this PR?

Also how do we schedule to run the Kubernetes tests? I would like if possible for this PR to aim to have an integration test with using argo so we know that it works.

Comment on lines +30 to +31
To turn off the cluster monitoring on Nebari deployments, simply turn off the feature flag within your
`nebari-config.yaml` file. For example:
Copy link
Member Author

@Adam-D-Lewis Adam-D-Lewis Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say "To disable Argo Workflows"

@kcpevey
Copy link
Contributor

kcpevey commented Feb 2, 2023

I moved the documentation updates to here

@iameskild
Copy link
Member

/bot run tests

@viniciusdc
Copy link
Contributor

@iameskild do you have an idea why the Kubernetes tests are not running for this?

Copy link
Contributor

@viniciusdc viniciusdc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks excellent, tested on Local/AWS instance. We need to move the docs to the new one and also we need a bigger task example to run.

@trallard trallard added type: enhancement 💅🏼 New feature or request needs: review 👀 This PR is complete and ready for reviewing area: integration/Argo needs: tests ✅ This contribution is missing tests and removed needs: review 👀 This PR is complete and ready for reviewing labels Feb 7, 2023
@kcpevey
Copy link
Contributor

kcpevey commented Feb 7, 2023

@ericdatakelly may have a better example to run

@ericdatakelly
Copy link
Contributor

I think we need several examples to cover different use cases. This looks like a good start, so how about if we make some new issues for other use cases? I'm working on the case in which a user wants to run an ML pipeline with multiple tasks that depend on output/artifacts from previous steps. I was running it from the ARGO UI. Now that Hera Workflows are working, I can move to that implementation (in python code), which is a better fit for the use case.

Copy link
Member

@iameskild iameskild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these update @Adam-D-Lewis!

Although it looks like the Kubernetes test didn't run, I can assume folks have this deployed and working on a fresh install. I instead tried deploying these changes to an existing cluster and unfortunately ran into some problems when trying to update the Argo-Workflow CRDs.

[terraform]: │ Error: rendered manifests contain a resource that already exists. Unable to continue with update: CustomResourceDefinition "clusterworkflowtemplates.argoproj.io" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set to "argo-workflows"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "dev"
[terraform]: │ 
[terraform]: │   with module.argo-workflows[0].helm_release.argo-workflows,
[terraform]: │   on modules/kubernetes/services/argo-workflows/main.tf line 6, in resource "helm_release" "argo-workflows":
[terraform]: │    6: resource "helm_release" "argo-workflows" {
[terraform]: │ 
[terraform]: ╵

At first I thought adding force_update = true would do the trick but upon further investigation, this looks like it is an upstream issue with Helm and updating CRDs in general.

At this point, if we need to get around this we might want to "render the CRDs via the helm_template data-source and apply them using kubernetes_manifest..." (quote source).

@iameskild iameskild added this to the Release 2023.2.1 milestone Feb 9, 2023
@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Feb 10, 2023

Thanks for reviewing @iameskild. So what would you say are next steps needed on this PR?

@iameskild
Copy link
Member

Hey @Adam-D-Lewis I see two potential workarounds:

  1. Add the Argo-Workflow CRDs as kubernetes_manfifest (like we do for traefik)
  • this is likely quite a tedious process
  1. Try adding another TF module that handles creating/managing the CRDs

Perhaps @costrouc has some other ideas for how to get around this.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Feb 11, 2023

I see that helm uses an annotation keep: true to persist resources in the event of an uninstall. I wonder if setting that to false by default would fix the problem since I see it's set to true in the crds in the argo workflows helm chart by default.

Is there a good way to test that quickly, Eskild? I saw you tried updating an existing deployment to test this PR. Is the best way to do that to deploy from develop branch then deploy them update from this PR branch or is there a more automated way to test updates?

Actually nevermind, I read helm's docs and the idea above doesn't seem likely to work.

@pavithraes pavithraes added the status: in progress 🏗 This task is currently being worked on label Feb 14, 2023
@Adam-D-Lewis
Copy link
Member Author

@iameskild I'll work on adding the Argo-Workflow CRDs as kubernetes_manifest since it seems like the most established solution.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Apr 5, 2023

By the way, the upgrade step would look like this for users (except for version 2023.5.1 instead of the version it says here).
image

@Adam-D-Lewis
Copy link
Member Author

I tested the latest version locally. I think users will need to disable argo_workflows, deploy, then enable, and redeploy before this will work correctly.

@pavithraes pavithraes added the impact: high 🟥 This issue affects most of the nebari users or is a critical issue label Apr 24, 2023
@iameskild
Copy link
Member

@Adam-D-Lewis is ready for a final review? I think this PR is also needed before #1622.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Apr 24, 2023

@eskild No, it's not quite ready yet. I want to test if users will need to disable argo_workflows, deploy, then enable, and redeploy before this will work correctly. If so, I might want to look at alternatives. I can do this this week.

I also want to get some feedback on the update prompt from the other devs before I'd say it's ready for review.

There are other things that still need to be done and can be done for #1622, right? This isn't blocking currently, right?

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Apr 24, 2023

@iameskild Yeah, I think this is ready for review now. I tested it again today and it seemed to work fine.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Apr 24, 2023

From @dharhas: "You need a link to docs that explain the whole process including how to set up kubectl."
@viniciusdc is working on PR to explain how to set up kubectl and afterwards I can a link in the text that is displayed during the update step before me merge this PR.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented Apr 25, 2023

I made Dharhas's suggested change. Here's the prompt now:

image

I'm assuming Vini's PR (nebari-dev/nebari-docs#315) gets merged before this.

Copy link
Member

@iameskild iameskild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great @Adam-D-Lewis!

I just tested this myself and things seem to be working as expected. I just left two minor comments but otherwise, I feel like this is ready to be merged.

nebari/upgrade.py Outdated Show resolved Hide resolved
nebari/upgrade.py Outdated Show resolved Hide resolved
Copy link
Member

@iameskild iameskild left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! As I mentioned yesterday, I tested this on an existing test cluster and everything seemed to work as expected. Thanks @Adam-D-Lewis 🎉

@Adam-D-Lewis
Copy link
Member Author

Added a commit to fix some failing test_upgrade tests

@Adam-D-Lewis Adam-D-Lewis merged commit c2bc1fd into nebari-dev:develop Apr 26, 2023
@pavithraes pavithraes added status: approved 💪🏾 This PR has been reviewed and approved for merge and removed status: in progress 🏗 This task is currently being worked on labels Apr 27, 2023
@pmeier pmeier mentioned this pull request May 2, 2023
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: integration/Argo impact: high 🟥 This issue affects most of the nebari users or is a critical issue needs: tests ✅ This contribution is missing tests status: approved 💪🏾 This PR has been reviewed and approved for merge type: enhancement 💅🏼 New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

8 participants