Replies: 10 comments 8 replies
-
Discussion between @jieguangzhou and @blythed on a call. Let us discuss the idea of providing a "deferred" from superduperdb.ext.transformers import lazy_pipeline
db.add(
lazy_pipeline('llm', 'llama-2-70b')
) |
Beta Was this translation helpful? Give feedback.
-
Enhanced support for serving HUGE modelsThere are four ways to do this
API connectionUse open source deployment projects on the deployment side Deployment Frameworkopenllm start facebook/opt-1.3b python -m vllm.entrypoints.api_server \
--model facebook/opt-13b \
--tensor-parallel-size 4
Model in SuperDuperDBOpenLLM import openllm
class OpenLLM(model):
def __init__(self, uri, *args, **kwargs):
self.client = openllm.client.HTTPClient(uri)
def predict(self, text):
return self.client.query('Explain to me the difference between "further" and "farther"') vLLM import request
class OpenLLM(model):
def __init__(self, uri, *args, **kwargs):
self.uri = uri
def predict(self, text):
inputs = {'prompt': text}
return requests.get(self.uri, params=inputs) RayLLM The local modelLoad the model directly on the main thread transformersimport transformers
from superduperdb.ext.transformers import Pipeline
model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.float16,
device_map="auto",
)
model = Pipeline(
identifier='my-sentiment-analysis',
task='text-generation',
preprocess=tokenizer,
object=pipeline,
torch_dtype=torch.float16,
device_map="auto",
)
db.add(model) vllmhttps://vllm.readthedocs.io/en/latest/serving/distributed_serving.html from vllm import LLM
from superduperdb.ext.vllm import VLLM
llm = LLM("meta-llama/Llama-2-7b-chat-hf", tensor_parallel_size=4)
model = VLLM(
identifier='meta-llama/Llama-2-7b-chat-hf',
model=llm,
**kwargs
)
db.add(model) The remote modelDistribute the model operation part to the remote service for execution connect to rayIf we need to run the model remotely, we should not real init the model. we have to write down the parameters the the class data, then send them to the remote server/cluster. We can use connect ray directly to run the model in ray cluster from superduperdb.ext.transformers import Pipeline
model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
def object_func():
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.float16,
device_map="auto",
)
return pipeline
model = Pipeline(
identifier='my-sentiment-analysis',
task='text-generation',
preprocess=tokenizer,
object=object_func,
torch_dtype=torch.float16,
device_map="auto",
remote='ray'
)
db.add(model) How to do this?we can add a new magic function of from ray import serve
class RayModel:
def __init__(self, model_func):
self.deployment = RayDeployment.bind(model_func)
serve.run(self.deployment, route_prefix="/")
def _forward(self, inputs):
return requests.get(
"http://localhost:8000/", params={"text": "Ray Serve is great!"}
)
# Not sure deployment can do this, if yes, use this
def _forward(self, inputs):
return self.deployment(inputs)
# 1: Wrap the pretrained sentiment analysis model in a Serve deployment.
@serve.deployment
class RayDeployment:
def __init__(self, model_func):
self.model = model_func()
def __call__(self, request: Request) -> Dict:
return self._model(request.query_params["text"])[0] class Model:
def __new__(cls, *args, **kwargs):
remote_mode = kwargs.get('remote')
if remote_mode == 'ray':
# make sure user have installed ray
object = kwargs.get('object')
ray_model = RayModel(object)
kwargs['object'] = object
elif remote_mode == 'superduperdb':
kwargs = SuperDuperRemoteModel(object)
return cls.__new__(cls, *args, **kwargs) reference: https://docs.ray.io/en/latest/serve/index.html vLLM + Rayreference: https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html Not sure we can use LLM instance directly that can connect ray, need to test. from vllm import LLM
from superduperdb.ext.vllm import VLLM
llm = LLM("meta-llama/Llama-2-7b-chat-hf", tensor_parallel_size=4)
model = VLLM(
identifier='meta-llama/Llama-2-7b-chat-hf',
object="meta-llama/Llama-2-7b-chat-hf",
remote=True,
**kwargs
)
db.add(model) If not, we can implement the feature like this. from vllm import LLM
class VLLM(Model):
def __init__(self, *args, **kwargs):
self.is_connect_to_ray = kwargs.pop('remote')
if self.is_connect_to_ray:
return init_remote_model(self, *args, **kwargs)
else:
llm_params = filter_LLM_init_params(LLM)(kwargs)
object = LLM(kwargs.get('object'), **llm_params)
def init_remote_model(self, *args, **kwargs):
command = "python -m vllm........."
def _forward(self, inputs):
if self.is_connect_to_ray:
return _remote_forward(self, inputs)
else:
self.object.generate(inputs)
from vllm import LLM
from superduperdb.ext.vllm import VLLM
model = VLLM(
identifier='meta-llama/Llama-2-7b-chat-hf',
object="meta-llama/Llama-2-7b-chat-hf"
remote=True
**kwargs
)
db.add(model) Model serviceDistribute the model operation part to SuperDuperDB unified model management service If we want to create a model cluster that can provide multi model service or
Then we can start a SuperDuperDB instance server connect to ray cluster or deploy on GPU server (we can use model deployment tool to manage the models, e.g BentoML or others). For client(Jupyter notebook or script) from vllm import LLM
from superduperdb.ext.vllm import VLLM
llm = LLM("meta-llama/Llama-2-7b-chat-hf", tensor_parallel_size=4)
model = VLLM(
identifier='llama-2-7b',
object="meta-llama/Llama-2-7b-chat-hf",
remote=True,
**kwargs
)
db.add(model) the model will create a request to SuperDuperDB server
After the server accept the request,
Then client will check whether the model is already working, if yes, the All the model call func will request the api from the server The advantages and disadvantagesAPI connection
The local model
The remote model
Model service
The plan and suggestions
Plan
|
Beta Was this translation helpful? Give feedback.
-
I need to try ray serve to deploy the llm model, because it didn’t go well when I used it before. |
Beta Was this translation helpful? Give feedback.
-
I've had a look at the |
Beta Was this translation helpful? Give feedback.
-
Documentation on setting up a |
Beta Was this translation helpful? Give feedback.
-
One way to start on |
Beta Was this translation helpful? Give feedback.
-
LLM Application ScenariosTechnology stack Inference
Training
|
Beta Was this translation helpful? Give feedback.
-
Inference Scenario SolutionsvLLM + Ray Serve Deployment
vllm’s github already has a pr of Adapter vllm-project/vllm#1804 , which can support the loading of multiple lora weights. Integration of SuperDuperDBUsing a SuperDuperModel Method to Support Ray Compute for Inference Create a class from SuperDuperModel to facilitate inference computations via Ray. import ray
from ray import serve
class SuperDuperModel:
def __new__(cls, *args, **kwargs):
instance = super(SuperDuperModel, cls).__new__(cls)
instance._init_args = args
instance._init_kwargs = kwargs
return instance
def pre_create(self):
self.__real_init()
def __real_init(self):
if run_on_ray(cfg):
self._init_on_ray()
else:
self._init_locally()
def _init_on_ray(self):
deployment_class = self._decorate_for_ray(self.__class__)
self.__class__ = deployment_class
self.__init__(*self._init_args, **self._init_kwargs)
def _decorate_for_ray(self, cls):
ray_config = self._init_kwargs.get('ray_config')
# @serve.deployment(ray_actor_options={"num_gpus": 1})
@serve.deployment(**ray_config)
class RayDeployment(cls):
pass
return RayDeployment
def _init_locally(self):
self.__init__(*self._init_args, **self._init_kwargs)
#-----------------------------------------------
## The user side
class LLMModel(SuperDuperModel):
def __init__(self, model_name):
self.llm = LLM(model=model_name)
def predict(self, prompt):
return self.llm.generate(prompt)
model = LLMModel("model_name")
db.add(model) Direct Creation of vLLM Model in the superduperdb.ext.llm Module Create a vLLM model directly within the import ray
from ray import serve
from vllm import LLM
class LLMModel(Model):
def __init__(self, model_name):
if check_ray_config(CFG):
cls_ = serve.deployment(
ray_actor_options={"num_gpus": 1})(BaseVLLMCore)
else:
cls_ = BaseVLLMCore
self.core = cls_(model_name)
def predict(self, prompt):
return self.core.predict(prompt)
class BaseVLLMCore:
def __init__(self, model_name):
self.llm = LLM(model=model_name)
def predict(self, prompt):
return self.llm.generate(prompt)
#-----------------------------------------------
## The user side
model = LLMModel("model_name")
db.add(model) Advantages and Disadvantages
The first method offers greater flexibility for different models, while the second is simpler for specifically integrating vLLM models. |
Beta Was this translation helpful? Give feedback.
-
Training Scenario SolutionI think the focus of integrating LLM into SuperDuperDB is not about what framework or method to use, but rather about clearly understanding how users will train an LLM model on SuperDuperDB.
The first scenario is more applicable. The second scenario, however, seems less relevant at the moment, as users tend to store SFT training datasets in local files rather than in the database. For the second scenario, we prefer to guide users to use the numerous open-source frameworks or scripts available online. After they train and obtain the Lora model, they can use SuperDuperDB for inference integration with the dataset. Therefore, we think we should focus on the first scenario. Based on this, I think we need to implement a scenario where we provide a link to the database, access a large number of documents, and perform LLM Continuous pre-training for knowledge learning. class SuperDuperLLMTrainer:
def __init__(self, train_config):
self.train_config = train_config
self._inference_model = None
def train(self, db, select):
func = route_to_func(self.train_config)
pass
def pretrain(self, db, select):
dataset = create_pretrain_dataset(db, select)
model = create_pretrain_model(self.train_config)
# train step
def pretrain_on_ray(self, db, select):
...
def finetune(self, db, select):
# Maybe not needed right now
...
# def predict(self, *args, **kwargs):
# # need to discuss should we merge the train and predict function into one class, maybe not
# # use inference model
# if self._inference_model is None:
# self._inference_model = create_inference_model()
# return self._inference_model.predict(*args, **kwargs)
train_config = TrainConfig(**kwargs)
llm_trainer = SuperDuperLLMTrainer(train_config)
llm_trainer.train(db) |
Beta Was this translation helpful? Give feedback.
-
I have a question, if we use vLLM for inference, which models are supported? Are these compatible with deepspeed training, for instance. |
Beta Was this translation helpful? Give feedback.
-
Plans and ideas to create "easy" functionality for serving large models from open-source.
Beta Was this translation helpful? Give feedback.
All reactions