Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lang): Support data and code integration in envd-server runner #530

Open
Tracked by #179
aseaday opened this issue Jun 30, 2022 · 23 comments
Open
Tracked by #179

feat(lang): Support data and code integration in envd-server runner #530

aseaday opened this issue Jun 30, 2022 · 23 comments

Comments

@aseaday
Copy link
Member

aseaday commented Jun 30, 2022

Descrption

I study the design docment and #303. In a team work environment, there is no doubt that model, exp log and dataset would be allso a neccessary part in a work env setting up. There some tools and project aimmed to solve it such as DVC.

This issue is for discussion about:

  • Should we support stroage for data sharing?
  • What is the best way to achieve it?

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

@gaocegege
Copy link
Member

Thanks for raising the discussions! There are some options for us:

  • Integrate with DVC. But DVC is not a de-facto standard now and it also has some problems.
  • Design the primitives in build.envd to declare expected locations of datasets/model/logs.
def build()
  config.model(location="xx", from="https://xx.savedmodel"

@gaocegege gaocegege changed the title discuss: storage sharing support in envd feat(lang): Support data and code integration in envd-server runner Sep 30, 2022
@gaocegege
Copy link
Member

Possible design:

runtime.docker.mount and other docker-specific funcs will be ignored in kubernetes context. runtime.kubernetes.mount() should create/use the corresponding PV/PVC in the backend pod.

runtime.kubernetes.mount("")

@VoVAllen
Copy link
Member

VoVAllen commented Oct 7, 2022

Write a proposal this week

@VoVAllen VoVAllen self-assigned this Oct 9, 2022
@VoVAllen
Copy link
Member

VoVAllen commented Oct 9, 2022

Mount Code Proposal

There will be two ways to mount the code, directly clone git repo or use syncthing to sync folder insider container with user's local directory.

Reference: Syncthing in Okteto

Possible syntax design:

# Can declare globally
info.git_repo = "https://github.com/tensorchord/envd"


def build():
    ...
    

def build_k8s():
    # Default entrypoint of envd server + k8s, if not found, use build function instead
    
    # Will create a PVC and clone the git repo into it
    runtime.kubernetes.mount_mode("git_repo")
    # Or
    # Will use syncthing to sync the folder insider container and user's local folder
    runtime.kubernetes.mount_mode("Syncthing")
    ...

envd up under envd-server context will use build_k8s as the default function, if not found, will fall back to build function

Problem:

  • Is envd-server strongly binded with k8s? How much details with k8s we want to expose here? Do we need extra abstraction at envd-server level?

@gaocegege
Copy link
Member

gaocegege commented Oct 9, 2022

Thanks for the proposal

Then do users need to write two build funcs for docker and k8s?

And, how to deal with the data part?

@VoVAllen
Copy link
Member

VoVAllen commented Oct 9, 2022

The original build func should work in most cases, but the pod on k8s may not be provisioned exactly as what user wants. Since build funcs won't contain any k8s specific logic, we can only have one default setting on k8s. User may need to tune details

@gaocegege
Copy link
Member

gaocegege commented Oct 10, 2022

How about use if statement in this case:

if platform == "k8s":
    runtime.kubernetes.code(src="", dst="")

@VoVAllen
Copy link
Member

It's viable. I think the core difference here is that build_k8s design implies build function is the major way of using envd at local docker environment. Therefore any changes to run on the k8s won't change anything already exists, but just write a new function.

@VoVAllen
Copy link
Member

For data, ideally, user needs to register the dataset detail at the envd-server at first. And declare io.mount(src=dataset("imagenet"), dst="/data") in the build.envd. However this is a bit complex, easier way is just exposing PVC details, such as io.mount(src=k8s.pvc("imagenet"), dst="/data")

@gaocegege
Copy link
Member

@kemingy @aseaday @zwpaper Do you have any question about it?

@VoVAllen
Copy link
Member

The initial version we can start with info.git_repo = "https://github.com/tensorchord/envd, and clone it into k8s pod. But syncthing will also be implemented later

@kemingy
Copy link
Member

kemingy commented Oct 12, 2022

I'm not sure why we need to declare a global info.git_repo object and use string like "git_repo" to refer to it.

For code sync, should we choose git clone or http?

How about use if statement in this case:

if platform == "k8s":
    runtime.kubernetes.code(src="", dst="")

This can save some boilerplate code. I'm wondering if we can hide the context keywords like docker or k8s. We can detect the context and choose the right implementation. If it's not implemented for such a context, just panic.

The difference between contexts can be handled by some configurations. For example:

if envd.context.type == "kubernetes":
    envd.mount_host = pvc("user-data")

So the user can run the code in different contexts.

If we introduce build_k8s, that means we need to have more default build functions for different type of contexts in the future.

@aseaday
Copy link
Member Author

aseaday commented Oct 12, 2022

I am not very sure the condition statement will be a good choice. We still should insist on statement over implementation. Or at least we'd better use a override class or something like polymorphism to solve the problem.

@zwpaper
Copy link
Member

zwpaper commented Oct 12, 2022

I don't think separating the docker and k8s build is a good idea.

let me try to explain myself from a user perspective.

Assuming I was a researcher, I did not want to copy and edit my build() to build_k8s() only because I was switching my env from docker to k8s.

As we have context, we know which target env users were building for, and pvc is de facto standard of storage solution in k8s(even host path, we should use something like local provisioned to implement pv), maybe we can set it by default and offer an option to change.

The philosophy behind envd is that we encapsulate the complexity and make things easier.

So personally, I would vote for a solution to simplify things, like:

  1. we keep all build scripts only in build()
  2. we detect and switch from docker and k8s automatically
  3. we use some path, ~/.env/data//code for example, for docker, PVC for k8s by default
  4. we offer options to change the docker mount point or storage type, storage parameters, etc.
  5. the default case should satisfy most of the users, research did not or even do not want to care about mount point or PVC,PV

for example:

build():
    runtime.code(src="", dst="")
build():
    runtime.code(src="[email protected]:envd.git", dst="/code", 
                 k8s_volume="host_path",
                 docker_volume="cache")

@gaocegege
Copy link
Member

I don't think separating the docker and k8s build is a good idea.

let me try to explain myself from a user perspective.

Assuming I was a researcher, I did not want to copy and edit my build() to build_k8s() only because I was switching my env from docker to k8s.

As we have context, we know which target env users were building for, and pvc is de facto standard of storage solution in k8s(even host path, we should use something like local provisioned to implement pv), maybe we can set it by default and offer an option to change.

The philosophy behind envd is that we encapsulate the complexity and make things easier.

So personally, I would vote for a solution to simplify things, like:

  1. we keep all build scripts only in build()
  2. we detect and switch from docker and k8s automatically
  3. we use some path, ~/.env/data//code for example, for docker, PVC for k8s by default
  4. we offer options to change the docker mount point or storage type, storage parameters, etc.
  5. the default case should satisfy most of the users, research did not or even do not want to care about mount point or PVC,PV

for example:

build():
    runtime.code(src="", dst="")
build():
    runtime.code(src="[email protected]:envd.git", dst="/code", 
                 k8s_volume="host_path",
                 docker_volume="cache")

Make sense. It is promising.

@gaocegege
Copy link
Member

Any opinions about it? @VoVAllen

@aseaday
Copy link
Member Author

aseaday commented Oct 13, 2022

I don't think separating the docker and k8s build is a good idea.

let me try to explain myself from a user perspective.

Assuming I was a researcher, I did not want to copy and edit my build() to build_k8s() only because I was switching my env from docker to k8s.

As we have context, we know which target env users were building for, and pvc is de facto standard of storage solution in k8s(even host path, we should use something like local provisioned to implement pv), maybe we can set it by default and offer an option to change.

The philosophy behind envd is that we encapsulate the complexity and make things easier.

So personally, I would vote for a solution to simplify things, like:

  1. we keep all build scripts only in build()
  2. we detect and switch from docker and k8s automatically
  3. we use some path, ~/.env/data//code for example, for docker, PVC for k8s by default
  4. we offer options to change the docker mount point or storage type, storage parameters, etc.
  5. the default case should satisfy most of the users, research did not or even do not want to care about mount point or PVC,PV

for example:

build():
    runtime.code(src="", dst="")
build():
    runtime.code(src="[email protected]:envd.git", dst="/code", 
                 k8s_volume="host_path",
                 docker_volume="cache")

We'd better provide a mount config base class to provide polymorphism config such as:

# mount_config.py
class MyMount:
  def kubernetes(self):
     self.k8s.runtime.use("pvc")
     code_src = self.code.src
     ...
  def host(self)

we could give a default logic about how to mount the code and the data. But it'd better be like a mount configuration class for differentcase. too much options in the function would cause difficulty to maintain.

 runtime.code(src="[email protected]:envd.git", dst="/code", configuration=MyMount)

@VoVAllen
Copy link
Member

Thanks @aseaday and @zwpaper . My core consideration here is also to make build function unified for both k8s and docker. The key design here is to support ad hoc configuration on k8s. I think @aseaday 's suggestion of using separate configurations to declare the implementation is a good design. Starlark doesn't support class, but we can use dict here directly for simplicity.

How about

runtime.code(src="[email protected]:envd.git", dst="/code", configuration={"k8s": {"pvc": "code"}})

@VoVAllen
Copy link
Member

VoVAllen commented Oct 20, 2022

After team discussion:

  • We think it's better to use the same function build and provide a way for user to identify current context
  • Use function such as config.info(repo="https://github.com/tensorchord/envd") to declare git repo information. And user can declare like runtime.mount(src=k8s.pvc('code'), dst=project_root()) for advanced settings

Related issue: #1054

@VoVAllen
Copy link
Member

VoVAllen commented Nov 9, 2022

Generally, I think PVC is a good abstraction fitting all data sources (nfs, oss, lustre, and so on) in enterprise-level usage. Therefore what we need to do here is just to decide what pvc to be mounted to the pod on k8s

  • PVC as dataset
    • User will specify by runtime.mount(src=data.dataset('daily_sales'), dst="/data"). This will be stored into image label, and won't affect the image build process. When launch the environment, backend can implement how to mount the dataset.
    • When launch the environment on envd-server, it will send a request to /k8s/data with {"user": "allen", "name": "daily_sales", project_name: "sales_prediction"}, and got a response about the PVC information and insert it into pod yaml
    • We can also provide mechanisms to sync dataset between different backend if needed
  • HTTP dataset
    - User can also specify data.dataset("http://xxx/mnist.tar.gz"), and we can download it when launching the pod.
  • SQL access (Advanced usage)
    - If user want to query information from database, we can also inject related credential environment variable if needed

@kemingy
Copy link
Member

kemingy commented Nov 9, 2022

What's the difference between data.envd and data.dataset? Does data.dataset work without envd-server?

@VoVAllen
Copy link
Member

VoVAllen commented Nov 9, 2022

@kemingy It's the same, I just randomly picked a name. data.dataset will only add label to the image. And the runner will decide how to handle it. Therefore local_docker runner can also implement this. But there's no guarantee that the dataset at different backends are identical

@VoVAllen
Copy link
Member

Previous discussion: #650

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

5 participants