Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Train API #1962

Merged
merged 33 commits into from
Jan 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
61d6d2d
adding constant for init container name
deepanker13 Nov 30, 2023
be49dfd
github workflow fixes
deepanker13 Dec 1, 2023
dc37467
removing constants file changes from this pr
deepanker13 Dec 1, 2023
bca36bd
code review changes
deepanker13 Dec 5, 2023
1882c81
initial skeleton of train api
deepanker13 Dec 7, 2023
3b4a1e8
train api updated
deepanker13 Dec 11, 2023
38b0a6a
fixes
deepanker13 Dec 15, 2023
177b499
code review changes
deepanker13 Dec 15, 2023
9ada882
code review changes
deepanker13 Dec 21, 2023
3ea7017
code review changes
deepanker13 Dec 21, 2023
db4d92d
code review changes
deepanker13 Jan 4, 2024
98b5c40
fixing python library requirements
deepanker13 Jan 4, 2024
fb26591
adding hugging face dataset download class
deepanker13 Dec 21, 2023
5c2a61a
code review changes
deepanker13 Jan 4, 2024
d979aac
fixing github workflow
deepanker13 Jan 4, 2024
d7619b2
code review comments
deepanker13 Jan 4, 2024
da9f7b3
import fixes
deepanker13 Jan 4, 2024
2687385
integration test fix for python3.7
deepanker13 Jan 5, 2024
c281de5
torch version fix for python3.7
deepanker13 Jan 5, 2024
722dc4b
removing unused variable
deepanker13 Jan 5, 2024
e4e9c79
fixing library versions for python3.7
deepanker13 Jan 5, 2024
9739d53
removing alpine distribution
deepanker13 Jan 5, 2024
ec461a1
removing torch ad dependency
deepanker13 Jan 5, 2024
c27159b
removing literal usage as python 3.7 doesn't support it
deepanker13 Jan 5, 2024
cc916f1
adding types.py
deepanker13 Jan 9, 2024
76fb00b
ci fix
deepanker13 Jan 9, 2024
2b72670
storage init container changes, fixing imports
deepanker13 Jan 10, 2024
a449ece
adding extra requires in setup.py, fixinf ci
deepanker13 Jan 10, 2024
0de6aa4
adding commit to retrigger go test
deepanker13 Jan 10, 2024
4488d02
renaming folder to storage initalizer
deepanker13 Jan 10, 2024
b2dbcd0
bug fix
deepanker13 Jan 10, 2024
4dd21c7
removing extra gpu check as discussed with johnu
deepanker13 Jan 10, 2024
ac3639e
retriggering ci
deepanker13 Jan 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/publish-core-images.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,6 @@ jobs:
dockerfile: build/images/training-operator/Dockerfile
- component-name: kubectl-delivery
dockerfile: build/images/kubectl-delivery/Dockerfile
- component-name: storage-initializer
dockerfile: sdk/python/kubeflow/storage_initializer/Dockerfile
context: sdk/python/kubeflow/storage_initializer
6 changes: 4 additions & 2 deletions .github/workflows/test-python.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@ jobs:
src: sdk/

- name: Install dependencies
run: pip install pytest python-dateutil urllib3 kubernetes

run: |
pip install pytest python-dateutil urllib3 kubernetes
pip install -U './sdk/python[huggingface]'

- name: Run unit test for training sdk
run: pytest ./sdk/python/kubeflow/training/api/training_client_test.py
10 changes: 10 additions & 0 deletions manifests/overlays/kubeflow/kubeflow-training-roles.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,16 @@ rules:
- paddlejobs/status
verbs:
- get
- apiGroups:
- ""
resources:
- persistentvolumeclaims
verbs:
- create
- delete
- get
- list
- watch

---
apiVersion: rbac.authorization.k8s.io/v1
Expand Down
59 changes: 0 additions & 59 deletions sdk/python/kubeflow/storage_init_container/hugging_face.py

This file was deleted.

6 changes: 0 additions & 6 deletions sdk/python/kubeflow/storage_init_container/requirements.txt

This file was deleted.

42 changes: 0 additions & 42 deletions sdk/python/kubeflow/storage_init_container/storage.py

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ FROM python:3.11
WORKDIR /app

# Copy the Python package and its source code into the container
COPY . /app/storage
COPY . /app/storage_initializer

# Copy the requirements.txt file into the container
COPY requirements.txt /app/requirements.txt
Expand All @@ -14,4 +14,4 @@ COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Run storage.py when the container launches
ENTRYPOINT ["python", "storage/storage.py"]
ENTRYPOINT ["python", "-m", "storage_initializer.storage"]
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,5 @@ def load_config(self):
pass

@abstractmethod
def download_model(self):
def download_model_and_tokenizer(self):
pass
3 changes: 3 additions & 0 deletions sdk/python/kubeflow/storage_initializer/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
INIT_CONTAINER_MOUNT_PATH = "/workspace"
VOLUME_PATH_DATASET = INIT_CONTAINER_MOUNT_PATH + "/dataset"
VOLUME_PATH_MODEL = INIT_CONTAINER_MOUNT_PATH + "/model"
87 changes: 87 additions & 0 deletions sdk/python/kubeflow/storage_initializer/hugging_face.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
from dataclasses import dataclass, field
import transformers
from peft import LoraConfig
from urllib.parse import urlparse
import json, os
from typing import Union
from .constants import VOLUME_PATH_DATASET, VOLUME_PATH_MODEL
from .abstract_model_provider import modelProvider
from .abstract_dataset_provider import datasetProvider


TRANSFORMER_TYPES = Union[
transformers.AutoModelForSequenceClassification,
transformers.AutoModelForTokenClassification,
transformers.AutoModelForQuestionAnswering,
transformers.AutoModelForCausalLM,
transformers.AutoModelForMaskedLM,
transformers.AutoModelForImageClassification,
]


@dataclass
class HuggingFaceModelParams:
model_uri: str
transformer_type: TRANSFORMER_TYPES
access_token: str = None

def __post_init__(self):
# Custom checks or validations can be added here
if self.model_uri == "" or self.model_uri is None:
raise ValueError("model_uri cannot be empty.")


@dataclass
class HuggingFaceTrainParams:
training_parameters: transformers.TrainingArguments = field(
default_factory=transformers.TrainingArguments
)
lora_config: LoraConfig = field(default_factory=LoraConfig)


class HuggingFace(modelProvider):
def load_config(self, serialised_args):
# implementation for loading the config
self.config = HuggingFaceModelParams(**json.loads(serialised_args))

def download_model_and_tokenizer(self):
# implementation for downloading the model
print("downloading model")
transformer_type_class = getattr(transformers, self.config.transformer_type)
parsed_uri = urlparse(self.config.model_uri)
self.model = parsed_uri.netloc + parsed_uri.path
transformer_type_class.from_pretrained(
self.model,
token=self.config.access_token,
cache_dir=VOLUME_PATH_MODEL,
trust_remote_code=True,
)
transformers.AutoTokenizer.from_pretrained(
self.model, cache_dir=VOLUME_PATH_MODEL
)


@dataclass
class HfDatasetParams:
repo_id: str
access_token: str = None

def __post_init__(self):
# Custom checks or validations can be added here
if self.repo_id == "" or self.repo_id is None:
raise ValueError("repo_id is None")


class HuggingFaceDataset(datasetProvider):
def load_config(self, serialised_args):
self.config = HfDatasetParams(**json.loads(serialised_args))

def download_dataset(self):
print("downloading dataset")
import huggingface_hub
from datasets import load_dataset

if self.config.access_token:
huggingface_hub.login(self.config.access_token)

load_dataset(self.config.repo_id, cache_dir=VOLUME_PATH_DATASET)
8 changes: 8 additions & 0 deletions sdk/python/kubeflow/storage_initializer/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
einops>=0.6.1
transformers_stream_generator==0.0.4
boto3==1.33.9
transformers>=4.20.0
peft>=0.3.0
huggingface_hub==0.16.4
datasets>=2.13.2

Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
from abstract_dataset_provider import datasetProvider
from dataclasses import dataclass, field
import json
import json, os
import boto3
from urllib.parse import urlparse
from .abstract_dataset_provider import datasetProvider
from .constants import VOLUME_PATH_DATASET


@dataclass
class S3DatasetParams:
access_key: str
secret_key: str
endpoint_url: str
bucket_name: str
file_key: str
region_name: str
download_dir: str = field(default="/workspace/datasets")
region_name: str = None
access_key: str = None
secret_key: str = None

def is_valid_url(self, url):
try:
Expand Down Expand Up @@ -50,6 +50,8 @@ def download_dataset(self):

# Download the file
s3_client.download_file(
self.config.bucket_name, self.config.file_key, self.config.download_dir
self.config.bucket_name,
self.config.file_key,
os.path.join(VOLUME_PATH_DATASET, self.config.file_key),
)
print(f"File downloaded to: {self.config.download_dir}")
print(f"File downloaded to: {VOLUME_PATH_DATASET}")
50 changes: 50 additions & 0 deletions sdk/python/kubeflow/storage_initializer/storage.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
import argparse
from .hugging_face import HuggingFace, HuggingFaceDataset
from .s3 import S3


def model_factory(model_provider, model_provider_parameters):
match model_provider:
case "hf":
hf = HuggingFace()
hf.load_config(model_provider_parameters)
hf.download_model_and_tokenizer()
case _:
return "This is the default case"


def dataset_factory(dataset_provider, dataset_provider_parameters):
match dataset_provider:
case "s3":
s3 = S3()
s3.load_config(dataset_provider_parameters)
s3.download_dataset()
case "hf":
hf = HuggingFaceDataset()
hf.load_config(dataset_provider_parameters)
hf.download_dataset()
case _:
return "This is the default case"


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="script for downloading model and datasets to PVC."
)
parser.add_argument("--model_provider", type=str, help="name of model provider")
parser.add_argument(
"--model_provider_parameters",
type=str,
help="model provider serialised arguments",
)

parser.add_argument("--dataset_provider", type=str, help="name of dataset provider")
parser.add_argument(
"--dataset_provider_parameters",
type=str,
help="dataset provider serialised arguments",
)
args = parser.parse_args()

model_factory(args.model_provider, args.model_provider_parameters)
dataset_factory(args.dataset_provider, args.dataset_provider_parameters)
Loading