Enhanced support for serving HUGE models (such as `Llama-2.0-70-billion`) #1235

blythed · 2023-11-03T12:07:23Z

blythed
Nov 3, 2023
Maintainer

Plans and ideas to create "easy" functionality for serving large models from open-source.

blythed · 2023-11-03T12:09:28Z

blythed
Nov 3, 2023
Maintainer Author

Discussion between @jieguangzhou and @blythed on a call.

Let us discuss the idea of providing a "deferred" transformers.pipeline object, which doesn't instantiate the
model "in-process", but invokes a ray worker to spin up the model.

from superduperdb.ext.transformers import lazy_pipeline

db.add(
    lazy_pipeline('llm', 'llama-2-70b')
)

3 replies

blythed Nov 3, 2023
Maintainer Author

Even better:

pipeline('llm', ..., defer='ray')

blythed Nov 3, 2023
Maintainer Author

Questions which I have:

Is it easy to spin up a local development environment for ray?
Is it possible to spin up workers with a request from another program?
...

blythed Nov 3, 2023
Maintainer Author

We can easily bootstrap this kind of wrapper function, using the inbuilt concepts of superduperdb. E.g. Model.preprocess, etc..

jieguangzhou · 2023-11-10T08:00:09Z

jieguangzhou
Nov 10, 2023
Collaborator

Enhanced support for serving HUGE models

There are four ways to do this

API connection
The local model
The remote model
Model service

API connection

Use open source deployment projects on the deployment side
Wrap the API and call it as a model

Deployment Framework

OpenLLM

openllm start facebook/opt-1.3b

vLLM

python -m vllm.entrypoints.api_server \
      --model facebook/opt-13b \
      --tensor-parallel-size 4

RayLLM

cache_dir=${XDG_CACHE_HOME:-$HOME/.cache}

docker run -it --gpus all --shm-size 1g -p 8000:8000 -e HF_HOME=~/data -v $cache_dir:~/data anyscale/ray-llm:latest bash
# Inside docker container
serve run ~/serve_configs/amazon--LightGPT.yaml

Model in SuperDuperDB

OpenLLM

import openllm
class OpenLLM(model):
	def __init__(self, uri, *args, **kwargs):
		self.client = openllm.client.HTTPClient(uri)
	def predict(self, text):
		return self.client.query('Explain to me the difference between "further" and "farther"')

vLLM

import request
class OpenLLM(model):
	def __init__(self, uri, *args, **kwargs):
		self.uri = uri
	def predict(self, text):
		inputs = {'prompt': text}
		return requests.get(self.uri, params=inputs)

RayLLM
RayLLM support openai format api, we can use openai package to request this

The local model

Load the model directly on the main thread

transformers

import transformers

from superduperdb.ext.transformers import Pipeline
model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

model = Pipeline(
    identifier='my-sentiment-analysis',
    task='text-generation',
    preprocess=tokenizer,
    object=pipeline,
    torch_dtype=torch.float16,
    device_map="auto",
)

db.add(model)

vllm

https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html

from vllm import LLM
from superduperdb.ext.vllm import VLLM
llm = LLM("meta-llama/Llama-2-7b-chat-hf", tensor_parallel_size=4)

model = VLLM(
	identifier='meta-llama/Llama-2-7b-chat-hf',
	model=llm,
	**kwargs
)

db.add(model)

The remote model

Distribute the model operation part to the remote service for execution

connect to ray

If we need to run the model remotely, we should not real init the model. we have to write down the parameters the the class data, then send them to the remote server/cluster.

We can use connect ray directly to run the model in ray cluster

from superduperdb.ext.transformers import Pipeline
model_id = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)

def object_func():
	pipeline = transformers.pipeline(
	    "text-generation",
	    model=model_id,
	    torch_dtype=torch.float16,
	    device_map="auto",
	)
	return pipeline

model = Pipeline(
    identifier='my-sentiment-analysis',
    task='text-generation',
    preprocess=tokenizer,
    object=object_func,
    torch_dtype=torch.float16,
    device_map="auto",
    remote='ray'
)

db.add(model)

How to do this?

we can add a new magic function of Model.__new__

from ray import serve

class RayModel:
	def __init__(self, model_func):
		self.deployment = RayDeployment.bind(model_func)
		serve.run(self.deployment, route_prefix="/")

	def _forward(self, inputs):
		return requests.get(
	        "http://localhost:8000/", params={"text": "Ray Serve is great!"}
	    )

	# Not sure deployment can do this, if yes, use this
	def _forward(self, inputs):
		return self.deployment(inputs)

# 1: Wrap the pretrained sentiment analysis model in a Serve deployment.
@serve.deployment
class RayDeployment:
    def __init__(self, model_func):
	    self.model = model_func()

    def __call__(self, request: Request) -> Dict:
        return self._model(request.query_params["text"])[0]

class Model:
	def __new__(cls, *args, **kwargs):
		remote_mode = kwargs.get('remote')
		if remote_mode == 'ray':
			# make sure user have installed ray
			object = kwargs.get('object')
			ray_model = RayModel(object)
			kwargs['object'] = object
		elif remote_mode == 'superduperdb':
			kwargs = SuperDuperRemoteModel(object)
		return cls.__new__(cls, *args, **kwargs)

reference: https://docs.ray.io/en/latest/serve/index.html

vLLM + Ray

reference: https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html

https://github.com/ray-project/ray/blob/cc983fc3e64c1ba215e981a43dd0119c03c74ff1/doc/source/serve/doc_code/vllm_example.py

Not sure we can use LLM instance directly that can connect ray, need to test.
If yes, we just need to wrap the vllm project

from vllm import LLM
from superduperdb.ext.vllm import VLLM
llm = LLM("meta-llama/Llama-2-7b-chat-hf", tensor_parallel_size=4)

model = VLLM(
	identifier='meta-llama/Llama-2-7b-chat-hf',
	object="meta-llama/Llama-2-7b-chat-hf",
	remote=True,
	**kwargs
)

db.add(model)

If not, we can implement the feature like this.

from vllm import LLM
class VLLM(Model):
	def __init__(self, *args, **kwargs):
		self.is_connect_to_ray = kwargs.pop('remote')
		if self.is_connect_to_ray:
			return init_remote_model(self, *args, **kwargs)
		else:
			llm_params = filter_LLM_init_params(LLM)(kwargs)
			object = LLM(kwargs.get('object'), **llm_params)
		
	def init_remote_model(self, *args, **kwargs):
		command = "python -m vllm........."


	def _forward(self, inputs):
		if self.is_connect_to_ray:
			return _remote_forward(self, inputs)
		else:
			self.object.generate(inputs)
		
from vllm import LLM
from superduperdb.ext.vllm import VLLM

model = VLLM(
	identifier='meta-llama/Llama-2-7b-chat-hf',
	object="meta-llama/Llama-2-7b-chat-hf"
	remote=True
	**kwargs
)

db.add(model)

Model service

Distribute the model operation part to SuperDuperDB unified model management service

If we want to create a model cluster that can provide multi model service or superduperdb cloud, we can create a model zoo dependent on ray or other cluster project.

superduperdb model service

Then we can start a SuperDuperDB instance server connect to ray cluster or deploy on GPU server (we can use model deployment tool to manage the models, e.g BentoML or others).

For client(Jupyter notebook or script)

from vllm import LLM
from superduperdb.ext.vllm import VLLM
llm = LLM("meta-llama/Llama-2-7b-chat-hf", tensor_parallel_size=4)

model = VLLM(
	identifier='llama-2-7b',
	object="meta-llama/Llama-2-7b-chat-hf",
	remote=True,
	**kwargs
)

db.add(model)

the model will create a request to SuperDuperDB server

tpye_id : xxxx
.....
remote : True
remote_message: ''

After the server accept the request,

load and add the model
create a deployment.
registry the meta data into database

tpye_id : xxxx
.....
remote : True
remote_message: {'host': '192.168.1.100', 'port': 12345} # json string

Then client will check whether the model is already working, if yes, the add func will be successful.

All the model call func will request the api from the server

The advantages and disadvantages

API connection

advantages
- Very easy to use
- Lower the threshold for using SuperDuperDB
disadvantages
- The model deployment level has nothing to do with superduperdb.

The local model

advantages
- easy to debug
- easy to understand
disadvantages
- Not good to host the huge models
- Not easy to expand

The remote model

advantages
- can use ray to host huge models
- easy to scale
disadvantages
- need to deploy ray cluster, maybe it is not easy to manage for the startups？
- difficult to debug

Model service

advantages
- Separates computing resources and business logic modules, reducing the load of SuperDuperDB
- easy to scale
- can use kubernetes to manage models
disadvantages
- difficult to debug
- It’s just an idea, it will takes a lot of time to design

The plan and suggestions

If we want to introduce SuperDuperDB at first, we can use the API connection first. Because many people can quick start.
The local model mode is ordinary, we only need to tell users how to write code
The remote model is good for us now, we can directly use ray to mange the cluster, and put models on it, connect the model with SuperDuperDB

Plan

Build a ray cluster on two servers.
Host a Llama model on ray cluster and use SuperDuperDB to wrap the model
- Llama 13B first single GPU
- Llama 70B on 4 * A100
Confirm detail API design and implementation
A use case post for this

0 replies

jieguangzhou · 2023-11-10T08:22:49Z

jieguangzhou
Nov 10, 2023
Collaborator

I need to try ray serve to deploy the llm model, because it didn’t go well when I used it before.

0 replies

blythed · 2023-12-06T09:25:02Z

blythed
Dec 6, 2023
Maintainer Author

I've had a look at the ray-llm project. It looks like quite a thin wrapper around ray so it may be better to implement our own ray implementation. Also there doesn't seem to be support for fine-tuning there.

4 replies

blythed Dec 6, 2023
Maintainer Author

We should support model-parallelism, using deepspeed: https://docs.ray.io/en/latest/train/deepspeed.html

blythed Dec 6, 2023
Maintainer Author

https://www.deepspeed.ai/tutorials/inference-tutorial/

blythed Dec 6, 2023
Maintainer Author

For the LORA part, we would need to see if we can propagate changes to the workers, after they have already been initialized.

blythed · 2023-12-06T13:49:02Z

blythed
Dec 6, 2023
Maintainer Author

Documentation on setting up a ray cluster.

https://docs.ray.io/en/latest/cluster/getting-started.html

0 replies

blythed · 2023-12-07T07:21:43Z

blythed
Dec 7, 2023
Maintainer Author

One way to start on ray would be to integrate vLLM which uses ray just for inference. We could then take those learnings to the full fine tuning etc.

0 replies

jieguangzhou · 2023-12-08T06:40:28Z

jieguangzhou
Dec 8, 2023
Collaborator

LLM Application Scenarios

Technology stack

Inference

LLM Inference Framework: vLLM
Remote Inference: Ray Serve

Training

Distributed Training Framework: Deepspeed
Deep Learning Frameworks: PyTorch, Transformers
Training Approaches: Pre-training, Supervised Fine-Tuning
Training Techniques: LoRA, QLoRA, Full-Parameter
Remote Training: Ray train

0 replies

jieguangzhou · 2023-12-08T07:20:30Z

jieguangzhou
Dec 8, 2023
Collaborator

Inference Scenario Solutions

vLLM + Ray Serve Deployment

Integration of vLLM with Ray Serve for deployment: vLLM Example on GitHub.

vllm’s github already has a pr of Adapter vllm-project/vllm#1804 , which can support the loading of multiple lora weights.

Integration of SuperDuperDB

Using a SuperDuperModel Method to Support Ray Compute for Inference

Create a class from SuperDuperModel to facilitate inference computations via Ray.

import ray
from ray import serve

class SuperDuperModel:
    def __new__(cls, *args, **kwargs):
        instance = super(SuperDuperModel, cls).__new__(cls)
        instance._init_args = args
        instance._init_kwargs = kwargs
        return instance

    def pre_create(self):
	self.__real_init()

    def __real_init(self):
        if run_on_ray(cfg):
            self._init_on_ray()
        else:
            self._init_locally()

    def _init_on_ray(self):
        deployment_class = self._decorate_for_ray(self.__class__)
        self.__class__ = deployment_class
        self.__init__(*self._init_args, **self._init_kwargs)

    def _decorate_for_ray(self, cls):
        ray_config = self._init_kwargs.get('ray_config')
        
        # @serve.deployment(ray_actor_options={"num_gpus": 1})
        @serve.deployment(**ray_config)
        class RayDeployment(cls):
            pass
        return RayDeployment

    def _init_locally(self):
        self.__init__(*self._init_args, **self._init_kwargs)

#-----------------------------------------------
## The user side

class LLMModel(SuperDuperModel):
    def __init__(self, model_name):
        self.llm = LLM(model=model_name)

    def predict(self, prompt):
        return self.llm.generate(prompt)

model = LLMModel("model_name")

db.add(model)

Direct Creation of vLLM Model in the superduperdb.ext.llm Module

Create a vLLM model directly within the superduperdb.ext.llm module.

import ray
from ray import serve
from vllm import LLM


class LLMModel(Model):
    def __init__(self, model_name):
        if check_ray_config(CFG):
            cls_ = serve.deployment(
                ray_actor_options={"num_gpus": 1})(BaseVLLMCore)
        else:
            cls_ = BaseVLLMCore

        self.core = cls_(model_name)

    def predict(self, prompt):
        return self.core.predict(prompt)


class BaseVLLMCore:
    def __init__(self, model_name):
        self.llm = LLM(model=model_name)

    def predict(self, prompt):
        return self.llm.generate(prompt)

#-----------------------------------------------
## The user side

model = LLMModel("model_name")
db.add(model)

Advantages and Disadvantages

SuperDuperModel Method
- Advantages: Versatile for various models, including Sklearn, Sentence-Transformer, XGBoost.
- Disadvantages: Potentially more complex to implement.
Direct Creation in superduperdb.ext.llm
- Advantages: Straightforward and easier to implement.
- Disadvantages: Limited in applicability, specifically tailored for vLLM model.

The first method offers greater flexibility for different models, while the second is simpler for specifically integrating vLLM models.

0 replies

jieguangzhou · 2023-12-08T07:36:30Z

jieguangzhou
Dec 8, 2023
Collaborator

Training Scenario Solution

I think the focus of integrating LLM into SuperDuperDB is not about what framework or method to use, but rather about clearly understanding how users will train an LLM model on SuperDuperDB.

There are a large number of documents in the database, requiring Continuous pre-training based on open-source LLM models.
The database contains SFT training set data, requiring Finetuning based on open-source LLM models.

The first scenario is more applicable. The second scenario, however, seems less relevant at the moment, as users tend to store SFT training datasets in local files rather than in the database.

For the second scenario, we prefer to guide users to use the numerous open-source frameworks or scripts available online. After they train and obtain the Lora model, they can use SuperDuperDB for inference integration with the dataset.

Therefore, we think we should focus on the first scenario.

Based on this, I think we need to implement a scenario where we provide a link to the database, access a large number of documents, and perform LLM Continuous pre-training for knowledge learning.

class SuperDuperLLMTrainer:
    def __init__(self, train_config):
        self.train_config = train_config
        self._inference_model = None

    def train(self, db, select):
        func = route_to_func(self.train_config)
        pass

    def pretrain(self, db, select):
        dataset = create_pretrain_dataset(db, select)
        model = create_pretrain_model(self.train_config)
        # train step

    def pretrain_on_ray(self, db, select):
        ...

    def finetune(self, db, select):
        # Maybe not needed right now
        ...

    # def predict(self, *args, **kwargs):
    #     # need to discuss should we merge the train and predict function into one class, maybe not 
    #     # use inference model
    #     if self._inference_model is None:
    #         self._inference_model = create_inference_model()

    #     return self._inference_model.predict(*args, **kwargs)


train_config = TrainConfig(**kwargs)
llm_trainer = SuperDuperLLMTrainer(train_config)

llm_trainer.train(db)

0 replies

blythed · 2023-12-08T09:04:05Z

blythed
Dec 8, 2023
Maintainer Author

I have a question, if we use vLLM for inference, which models are supported? Are these compatible with deepspeed training, for instance.

1 reply

jieguangzhou Dec 8, 2023
Collaborator

There may be some unsupported and incompatible frameworks, but it is a very popular framework currently. If a model is commonly used by everyone, it should be compatible soon. I think we only support this part at the moment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhanced support for serving HUGE models (such as `Llama-2.0-70-billion`) #1235

{{title}}

Replies: 10 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Enhanced support for serving HUGE models (such as Llama-2.0-70-billion) #1235

blythed Nov 3, 2023 Maintainer

Replies: 10 comments · 8 replies

blythed Nov 3, 2023 Maintainer Author

blythed Nov 3, 2023 Maintainer Author

blythed Nov 3, 2023 Maintainer Author

blythed Nov 3, 2023 Maintainer Author

jieguangzhou Nov 10, 2023 Collaborator

Enhanced support for serving HUGE models

API connection

Deployment Framework

Model in SuperDuperDB

The local model

transformers

vllm

The remote model

connect to ray

How to do this?

vLLM + Ray

Model service

The advantages and disadvantages

The plan and suggestions

jieguangzhou Nov 10, 2023 Collaborator

blythed Dec 6, 2023 Maintainer Author

blythed Dec 6, 2023 Maintainer Author

blythed Dec 6, 2023 Maintainer Author

blythed Dec 6, 2023 Maintainer Author

blythed Dec 6, 2023 Maintainer Author

blythed Dec 6, 2023 Maintainer Author

blythed Dec 7, 2023 Maintainer Author

jieguangzhou Dec 8, 2023 Collaborator

LLM Application Scenarios

Inference

Training

jieguangzhou Dec 8, 2023 Collaborator

Inference Scenario Solutions

vLLM + Ray Serve Deployment

Integration of SuperDuperDB

Advantages and Disadvantages

jieguangzhou Dec 8, 2023 Collaborator

Training Scenario Solution

blythed Dec 8, 2023 Maintainer Author

jieguangzhou Dec 8, 2023 Collaborator

Enhanced support for serving HUGE models (such as `Llama-2.0-70-billion`) #1235

blythed
Nov 3, 2023
Maintainer

Replies: 10 comments 8 replies

blythed
Nov 3, 2023
Maintainer Author

blythed Nov 3, 2023
Maintainer Author

blythed Nov 3, 2023
Maintainer Author

blythed Nov 3, 2023
Maintainer Author

jieguangzhou
Nov 10, 2023
Collaborator

jieguangzhou
Nov 10, 2023
Collaborator

blythed
Dec 6, 2023
Maintainer Author

blythed Dec 6, 2023
Maintainer Author

blythed Dec 6, 2023
Maintainer Author

blythed Dec 6, 2023
Maintainer Author

blythed Dec 6, 2023
Maintainer Author

blythed
Dec 6, 2023
Maintainer Author

blythed
Dec 7, 2023
Maintainer Author

jieguangzhou
Dec 8, 2023
Collaborator

jieguangzhou
Dec 8, 2023
Collaborator

jieguangzhou
Dec 8, 2023
Collaborator

blythed
Dec 8, 2023
Maintainer Author

jieguangzhou Dec 8, 2023
Collaborator