Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Proposal] Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon #4

Open
hiro-o918 opened this issue Mar 15, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@hiro-o918
Copy link

hiro-o918 commented Mar 15, 2023

Feature Request

I would like to be able to seamlessly run jobs on a k8s cluster from my local machine using Kanon.
To achieve this, we need to avoid creating images every time to reflect changes in first-party tasks.

This would require many changes to the Kanon library, so I would like to discuss it.

Motivation

Machine learning jobs require a lot of data and computation, which makes it impractical to run them on a local machine.
However, it is easy to run them on a k8s cluster because of their flexible resources.

As you may know, running jobs on a k8s cluster requires a Docker image.
The current implementation of Kanon, as far as I know, requires an image containing all the tasks to run.
This is inconvenient because it requires building the image repeatedly when ML engineers try out their logics.

Proposal

To achieve this, there are several issues that need to be addressed.

Send first-party packages to the job container when apply the job.
Add CLI options to specify a task as the entry point, like Gokart.
Allow for manual rerunning of a task even if it has already succeeded.

1. Send first-party packages to the job container when apply the job

As previously mentioned, we need to avoid building images every time a change is made to the code.
Since Python is an interpreted language, we don't need to compile code when building a Docker image.
This means that the image only requires the Python runtime and third-party dependencies.
Therefore, we can send the first-party packages to the job container when apply the job.

2. Allow for manual rerunning of a task even if it has already succeeded

Currently, Kanon requires specifying a root task in Python code to resolve the order of tasks to run.
While this is sufficient for production or integration environments, it's not enough for local development.
When ML engineers are testing their logic, they may want to run a single task.
In this case, we need to specify a task as the entry point.

3. Manually rerun a task even if it succeeded

When ML engineers are testing their logic, they may want to rerun a task even if it has already succeeded, in order to check the results of their modifications.
As you may know, gokart caches the output of a task as a pickle file, and won't rerun a task if the pickle file exists and no parameters have changed.
Therefore, we need an option that allows for forcing the rerunning of a task, such as the --rerun option in gokart.

Draft Design

Here is a draft of the design I came up with. I'm not sure if it is the best design, but I'd like to discuss it.

Send first-party packages to the job container by tarball (for Proposal 1.)

We can send first-party packages to the job container by tarball.
I think the implementation of skaffold to run kaniko is a good reference.
Skaffold is a tool to create a pipeline to deploy applications to a k8s cluster.
It has the option to use kaniko to build a docker image on a k8s cluster instead of locally.

Since the kaniko image does not contain dependencies (build context) to build applications, Skaffold needs to send them to a kaniko pod.

The following code is used to send the build context to a kaniko pod by tarball.
https://github.com/GoogleContainerTools/skaffold/blob/908c36a893faa3729d121e273855a3749f2335b5/pkg/skaffold/build/cluster/kaniko.go#L146-L179

We can use the same way to send first-party packages to the job container by tarball.
Of course, we must add some options to Kanon to specify the location of the first-party folders, and extract them into PYTHONPATH when the job container starts.

Expose CLI options of gokart (for Proposal 2., 3.)

I'm not sure if it can be implemented, so this is an idea rather than a design.
Since gokart has a lot of useful options for development purposes like --rerun and --modification_time_check, it's good to depend on it.

Please consider introducing CLI like the following:

kanon run \
    # here are options for kanon
    --namespace default --task TaskA
    # separator of kanon and gokart options
    --
    # after `--`, here are options for gokart
    --rerun --modification_time_check
    # --tasks-a-param is a parameter of TaskA
    --task-a-param 1

Thanks for reading this long proposal and great library! 🙏
I'm looking forward to your feedback.

@yokomotod
Copy link
Collaborator

Thank you for your detailed suggestion, I super like this idea.
I too am very interested in improving the experience during local development, not only production operations.

@maronuu
Copy link
Collaborator

maronuu commented Mar 16, 2023

Thank you for your detailed proposal! That sounds fascinating to me. 👍

Summary

Feature 2, 3 can come after Feature 1.
Feature 1 seems feasible and simple to implement.
2, 3 are not known to be feasible. A more detailed discussion might be needed.

Comments

Here are some comments for some parts!

This is inconvenient because it requires building the image repeatedly when ML engineers try out their logics.

I completely agree. We also recognize the importance of developer experience in the local env.

Send first-party packages to the job container by tarball (for Proposal 1.)

The existing codes from skaffold are so helpful, Thanks!
It seems feasible to split python runtime and other stuff dependent on the user-defined tasks using that way.

Expose CLI options of gokart (for Proposal 2., 3.)

This feature also looks attractive. Considering --rerun and --modification-time-check and other useful options are already supported in gokart, it is so valuable for kannon to support them.

--rerun option seems relatively difficult in the following viewpoints.

  1. Propagating "should rerun or not" flag on the DAG.
    kannon resolves task dependencies using the newly added implementation that does not exist in gokart. It is required to propagate the 'should rerun or not' boolean flag on the task DAG. Seems can be done, but not a quick task.

  2. The master job needs to tell "Task A, B, and C are to be rerun, others are cached." to the child jobs
    The logic between the master and child job is static, i.e., implemented in the library. In contrast, it is dynamically determined what tasks to be rerun. This is a difficult point to solve.

--modification-time-check seems to be in a similar situation to --rerun. They seem feasible things, but I have no detailed idea of implementation now.

Thank you for your issue again!

@maronuu maronuu added the enhancement New feature or request label Mar 16, 2023
@hiro-o918
Copy link
Author

hiro-o918 commented Mar 17, 2023

Challenges posed by Features 2 and 3

One of the reasons for the challenges posed by Features 2 and 3 is that kanon works as a wrapper around gokart (luigi), which makes it difficult to use the native features of tasks such as the DAG of tasks that can be resolved by luigi on its own.

Potential Solution

One possible approach to address this challenge is to implement a patch function that replaces Luigi's worker logic with a k8s job runner. However, this approach has a drawback of creating tight coupling between kanon and luigi, which can result in maintenance difficulties in the future.

But It may be better to avoid re-implementing the features of tasks and instead rely on them to implement flags such as --rerun and --modification-time-check.

Controversial Point and Further Discussion

This is a controversial point, and it may be beneficial to discuss it further before proceeding with implementation.

@maronuu
Copy link
Collaborator

maronuu commented Mar 19, 2023

One possible approach to address this challenge is to implement a patch function that replaces Luigi's worker logic with a k8s job runner.

Yes, it will solve the difficulty.

IMO, making a tight coupling between kannon and luigi has pros and cons:

Pros

  • Features 2 and 3 can be implemented in the way you proposed.
  • Since luigi has a larger community than gokart, kannon can also have an impact on the Luigi community beyond the gokart community.

Cons

  • Maintainability (as you point out)
  • I am not familiar with the luigi implementation at this stage, so I do not know how hard it would be to replace worker logic.

@yokomotod @hirosassa @kitagry @Hi-king How do you think?

@maronuu maronuu changed the title Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon [Feature Proposal] Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon Mar 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants