[Feature Proposal] Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon #4

hiro-o918 · 2023-03-15T16:28:40Z

Feature Request

I would like to be able to seamlessly run jobs on a k8s cluster from my local machine using Kanon.
To achieve this, we need to avoid creating images every time to reflect changes in first-party tasks.

This would require many changes to the Kanon library, so I would like to discuss it.

Motivation

Machine learning jobs require a lot of data and computation, which makes it impractical to run them on a local machine.
However, it is easy to run them on a k8s cluster because of their flexible resources.

As you may know, running jobs on a k8s cluster requires a Docker image.
The current implementation of Kanon, as far as I know, requires an image containing all the tasks to run.
This is inconvenient because it requires building the image repeatedly when ML engineers try out their logics.

Proposal

To achieve this, there are several issues that need to be addressed.

Send first-party packages to the job container when apply the job.
Add CLI options to specify a task as the entry point, like Gokart.
Allow for manual rerunning of a task even if it has already succeeded.

1. Send first-party packages to the job container when apply the job

As previously mentioned, we need to avoid building images every time a change is made to the code.
Since Python is an interpreted language, we don't need to compile code when building a Docker image.
This means that the image only requires the Python runtime and third-party dependencies.
Therefore, we can send the first-party packages to the job container when apply the job.

2. Allow for manual rerunning of a task even if it has already succeeded

Currently, Kanon requires specifying a root task in Python code to resolve the order of tasks to run.
While this is sufficient for production or integration environments, it's not enough for local development.
When ML engineers are testing their logic, they may want to run a single task.
In this case, we need to specify a task as the entry point.

3. Manually rerun a task even if it succeeded

When ML engineers are testing their logic, they may want to rerun a task even if it has already succeeded, in order to check the results of their modifications.
As you may know, gokart caches the output of a task as a pickle file, and won't rerun a task if the pickle file exists and no parameters have changed.
Therefore, we need an option that allows for forcing the rerunning of a task, such as the --rerun option in gokart.

Draft Design

Here is a draft of the design I came up with. I'm not sure if it is the best design, but I'd like to discuss it.

Send first-party packages to the job container by tarball (for Proposal 1.)

We can send first-party packages to the job container by tarball.
I think the implementation of skaffold to run kaniko is a good reference.
Skaffold is a tool to create a pipeline to deploy applications to a k8s cluster.
It has the option to use kaniko to build a docker image on a k8s cluster instead of locally.

Since the kaniko image does not contain dependencies (build context) to build applications, Skaffold needs to send them to a kaniko pod.

The following code is used to send the build context to a kaniko pod by tarball.
https://github.com/GoogleContainerTools/skaffold/blob/908c36a893faa3729d121e273855a3749f2335b5/pkg/skaffold/build/cluster/kaniko.go#L146-L179

We can use the same way to send first-party packages to the job container by tarball.
Of course, we must add some options to Kanon to specify the location of the first-party folders, and extract them into PYTHONPATH when the job container starts.

Expose CLI options of gokart (for Proposal 2., 3.)

I'm not sure if it can be implemented, so this is an idea rather than a design.
Since gokart has a lot of useful options for development purposes like --rerun and --modification_time_check, it's good to depend on it.

Please consider introducing CLI like the following:

kanon run \
    # here are options for kanon
    --namespace default --task TaskA
    # separator of kanon and gokart options
    --
    # after `--`, here are options for gokart
    --rerun --modification_time_check
    # --tasks-a-param is a parameter of TaskA
    --task-a-param 1

Thanks for reading this long proposal and great library! 🙏
I'm looking forward to your feedback.

The text was updated successfully, but these errors were encountered:

yokomotod · 2023-03-16T03:05:54Z

Thank you for your detailed suggestion, I super like this idea.
I too am very interested in improving the experience during local development, not only production operations.

maronuu · 2023-03-16T16:06:59Z

Thank you for your detailed proposal! That sounds fascinating to me. 👍

Summary

Feature 2, 3 can come after Feature 1.
Feature 1 seems feasible and simple to implement.
2, 3 are not known to be feasible. A more detailed discussion might be needed.

Comments

Here are some comments for some parts!

This is inconvenient because it requires building the image repeatedly when ML engineers try out their logics.

I completely agree. We also recognize the importance of developer experience in the local env.

Send first-party packages to the job container by tarball (for Proposal 1.)

The existing codes from skaffold are so helpful, Thanks!
It seems feasible to split python runtime and other stuff dependent on the user-defined tasks using that way.

Expose CLI options of gokart (for Proposal 2., 3.)

This feature also looks attractive. Considering --rerun and --modification-time-check and other useful options are already supported in gokart, it is so valuable for kannon to support them.

--rerun option seems relatively difficult in the following viewpoints.

Propagating "should rerun or not" flag on the DAG.
kannon resolves task dependencies using the newly added implementation that does not exist in gokart. It is required to propagate the 'should rerun or not' boolean flag on the task DAG. Seems can be done, but not a quick task.
The master job needs to tell "Task A, B, and C are to be rerun, others are cached." to the child jobs
The logic between the master and child job is static, i.e., implemented in the library. In contrast, it is dynamically determined what tasks to be rerun. This is a difficult point to solve.

--modification-time-check seems to be in a similar situation to --rerun. They seem feasible things, but I have no detailed idea of implementation now.

Thank you for your issue again!

hiro-o918 · 2023-03-17T01:56:31Z

Challenges posed by Features 2 and 3

One of the reasons for the challenges posed by Features 2 and 3 is that kanon works as a wrapper around gokart (luigi), which makes it difficult to use the native features of tasks such as the DAG of tasks that can be resolved by luigi on its own.

Potential Solution

One possible approach to address this challenge is to implement a patch function that replaces Luigi's worker logic with a k8s job runner. However, this approach has a drawback of creating tight coupling between kanon and luigi, which can result in maintenance difficulties in the future.

But It may be better to avoid re-implementing the features of tasks and instead rely on them to implement flags such as --rerun and --modification-time-check.

Controversial Point and Further Discussion

This is a controversial point, and it may be beneficial to discuss it further before proceeding with implementation.

maronuu · 2023-03-19T06:43:50Z

One possible approach to address this challenge is to implement a patch function that replaces Luigi's worker logic with a k8s job runner.

Yes, it will solve the difficulty.

IMO, making a tight coupling between kannon and luigi has pros and cons:

Pros

Features 2 and 3 can be implemented in the way you proposed.
Since luigi has a larger community than gokart, kannon can also have an impact on the Luigi community beyond the gokart community.

Cons

Maintainability (as you point out)
I am not familiar with the luigi implementation at this stage, so I do not know how hard it would be to replace worker logic.

@yokomotod @hirosassa @kitagry @Hi-king How do you think?

maronuu added the enhancement New feature or request label Mar 16, 2023

maronuu changed the title ~~Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon~~ [Feature Proposal] Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon Mar 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Proposal] Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon #4

[Feature Proposal] Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon #4

hiro-o918 commented Mar 15, 2023 •

edited

Loading

yokomotod commented Mar 16, 2023

maronuu commented Mar 16, 2023 •

edited

Loading

hiro-o918 commented Mar 17, 2023 •

edited

Loading

maronuu commented Mar 19, 2023

[Feature Proposal] Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon #4

[Feature Proposal] Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon #4

Comments

hiro-o918 commented Mar 15, 2023 • edited Loading

Feature Request

Motivation

Proposal

1. Send first-party packages to the job container when apply the job

2. Allow for manual rerunning of a task even if it has already succeeded

3. Manually rerun a task even if it succeeded

Draft Design

Send first-party packages to the job container by tarball (for Proposal 1.)

Expose CLI options of gokart (for Proposal 2., 3.)

yokomotod commented Mar 16, 2023

maronuu commented Mar 16, 2023 • edited Loading

Summary

Comments

hiro-o918 commented Mar 17, 2023 • edited Loading

Challenges posed by Features 2 and 3

Potential Solution

Controversial Point and Further Discussion

maronuu commented Mar 19, 2023

Pros

Cons

hiro-o918 commented Mar 15, 2023 •

edited

Loading

maronuu commented Mar 16, 2023 •

edited

Loading

hiro-o918 commented Mar 17, 2023 •

edited

Loading