-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Proposal] Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon #4
Comments
Thank you for your detailed suggestion, I super like this idea. |
Thank you for your detailed proposal! That sounds fascinating to me. 👍 SummaryFeature 2, 3 can come after Feature 1. CommentsHere are some comments for some parts!
I completely agree. We also recognize the importance of developer experience in the local env.
The existing codes from
This feature also looks attractive. Considering
Thank you for your issue again! |
Challenges posed by Features 2 and 3One of the reasons for the challenges posed by Features 2 and 3 is that kanon works as a wrapper around gokart (luigi), which makes it difficult to use the native features of tasks such as the DAG of tasks that can be resolved by luigi on its own. Potential SolutionOne possible approach to address this challenge is to implement a patch function that replaces Luigi's worker logic with a k8s job runner. However, this approach has a drawback of creating tight coupling between kanon and luigi, which can result in maintenance difficulties in the future. But It may be better to avoid re-implementing the features of tasks and instead rely on them to implement flags such as Controversial Point and Further DiscussionThis is a controversial point, and it may be beneficial to discuss it further before proceeding with implementation. |
Yes, it will solve the difficulty. IMO, making a tight coupling between kannon and luigi has pros and cons: Pros
Cons
@yokomotod @hirosassa @kitagry @Hi-king How do you think? |
Feature Request
I would like to be able to seamlessly run jobs on a k8s cluster from my local machine using Kanon.
To achieve this, we need to avoid creating images every time to reflect changes in first-party tasks.
This would require many changes to the Kanon library, so I would like to discuss it.
Motivation
Machine learning jobs require a lot of data and computation, which makes it impractical to run them on a local machine.
However, it is easy to run them on a k8s cluster because of their flexible resources.
As you may know, running jobs on a k8s cluster requires a Docker image.
The current implementation of Kanon, as far as I know, requires an image containing all the tasks to run.
This is inconvenient because it requires building the image repeatedly when ML engineers try out their logics.
Proposal
To achieve this, there are several issues that need to be addressed.
Send first-party packages to the job container when apply the job.
Add CLI options to specify a task as the entry point, like Gokart.
Allow for manual rerunning of a task even if it has already succeeded.
1. Send first-party packages to the job container when apply the job
As previously mentioned, we need to avoid building images every time a change is made to the code.
Since Python is an interpreted language, we don't need to compile code when building a Docker image.
This means that the image only requires the Python runtime and third-party dependencies.
Therefore, we can send the first-party packages to the job container when apply the job.
2. Allow for manual rerunning of a task even if it has already succeeded
Currently, Kanon requires specifying a root task in Python code to resolve the order of tasks to run.
While this is sufficient for production or integration environments, it's not enough for local development.
When ML engineers are testing their logic, they may want to run a single task.
In this case, we need to specify a task as the entry point.
3. Manually rerun a task even if it succeeded
When ML engineers are testing their logic, they may want to rerun a task even if it has already succeeded, in order to check the results of their modifications.
As you may know, gokart caches the output of a task as a pickle file, and won't rerun a task if the pickle file exists and no parameters have changed.
Therefore, we need an option that allows for forcing the rerunning of a task, such as the
--rerun
option in gokart.Draft Design
Here is a draft of the design I came up with. I'm not sure if it is the best design, but I'd like to discuss it.
Send first-party packages to the job container by tarball (for Proposal 1.)
We can send first-party packages to the job container by tarball.
I think the implementation of skaffold to run kaniko is a good reference.
Skaffold is a tool to create a pipeline to deploy applications to a k8s cluster.
It has the option to use kaniko to build a docker image on a k8s cluster instead of locally.
Since the kaniko image does not contain dependencies (build context) to build applications, Skaffold needs to send them to a kaniko pod.
The following code is used to send the build context to a kaniko pod by tarball.
https://github.com/GoogleContainerTools/skaffold/blob/908c36a893faa3729d121e273855a3749f2335b5/pkg/skaffold/build/cluster/kaniko.go#L146-L179
We can use the same way to send first-party packages to the job container by tarball.
Of course, we must add some options to Kanon to specify the location of the first-party folders, and extract them into PYTHONPATH when the job container starts.
Expose CLI options of gokart (for Proposal 2., 3.)
I'm not sure if it can be implemented, so this is an idea rather than a design.
Since gokart has a lot of useful options for development purposes like
--rerun
and--modification_time_check
, it's good to depend on it.Please consider introducing CLI like the following:
Thanks for reading this long proposal and great library! 🙏
I'm looking forward to your feedback.
The text was updated successfully, but these errors were encountered: