-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add proposal for data support #650
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
# Data function support | ||
|
||
## Summary | ||
|
||
To provide mount data support for envd | ||
|
||
|
||
## Goals | ||
|
||
Design a *unified, declarative* interface and underlying architecture to provide dataset in the development environment in a *scalable way* | ||
|
||
|
||
Non-goals: | ||
- Support Git-like version control for data | ||
|
||
## Common Scenarios | ||
|
||
### Possible sources | ||
- local files | ||
- Object storage (AWS S3) | ||
- NFS-like system (AWS EFS, AWS FSx for OpenZFS) | ||
- Block storage (Ceph) | ||
- HDFS | ||
- Lustre | ||
- API endpoint (http path) | ||
- SQL results | ||
- Other distributed fs (alluxio, juicefs) | ||
- Python SDK | ||
|
||
### Possible form | ||
- Images | ||
- Text | ||
- Embedding binarys | ||
- CSV | ||
|
||
### Access Pattern | ||
|
||
The access pattern of most dataset is write once, read multiple times, and concurrently. Therefore | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing some additional context after "therefore"? |
||
|
||
### Possible versions/tags | ||
- Version by number, V1, V2, V3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about semantic versioning? |
||
- Version by scale, sample dataset vs full dataset | ||
- Version by time, query range of user activity (7d, 30d) from feature store | ||
|
||
We can have a new standard on how to version the data like semver | ||
|
||
## Proposal | ||
|
||
Each version of dataset is immutable. By assuming the data is immutable, we can cache the data and make replication easily, to increase the read throughput in multiple ways. | ||
|
||
|
||
### Usage | ||
|
||
User need to create the dataset beforehand. Than declare mounting in the build.envd file. | ||
|
||
``` | ||
envd data add -f mnist.yaml | ||
``` | ||
Comment on lines
+56
to
+58
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we can make it as an envd target so we can get rid of yaml? |
||
|
||
User can create multiple dataset with the same name, but need to be different versions | ||
|
||
mnist.yaml | ||
```yaml= | ||
ApiVersion: V1alpha | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is this version for? How is this different from the version below? |
||
name: mnist | ||
version: "0.0.1-sample" | ||
sources: | ||
Comment on lines
+66
to
+67
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if there are multiple versions for different sources? |
||
- type: local # First source will be considered major source, others are the replication of this one | ||
path: ~/.torch/mnist | ||
- type: s3 | ||
path: xxx | ||
validation: | ||
checksum: | ||
- name: MD5 | ||
value: xxxx | ||
``` | ||
|
||
build.envd | ||
``` | ||
def data(): | ||
return [d.mount("mnist", target="./data")] # User can specify mount multiple datasets | ||
Comment on lines
+79
to
+81
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the different purposes of the YAML above and this envd syntax? |
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any implementation plan for this? Are you aware of any existing solutions that could support multiple sources? This might be helpful as a reference for the range of sources: https://kubernetes.io/docs/concepts/storage/volumes/#volume-types