Skip to content

Commit

Permalink
- [Docs]: Added vLLM example
Browse files Browse the repository at this point in the history
- [Landing]: Replaced the highlights section with examples
  • Loading branch information
peterschmidt85 committed Jul 23, 2023
1 parent 416212b commit 88bb4de
Show file tree
Hide file tree
Showing 6 changed files with 213 additions and 184 deletions.
Binary file removed docs/assets/images/dstack-llmchat-welcome.png
Binary file not shown.
57 changes: 14 additions & 43 deletions docs/examples/llmchat.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,18 @@
# LLM as Chatbot

![dstack-hub-create-azure-project.png](../assets/images/dstack-llmchat-gallery.png){ width=800 }
# Chatbot

This [example](https://github.com/deep-diver/LLM-As-Chatbot) is built by Chansung Park. It can run any open-source LLM either as a Gradio chat app or as a Discord bot.
With `dstack`, you can run this Gradio chat app or Discord bot in any cloud with a single command.
To give it a try, follow the instructions below.
To try this example with `dstack`, follow the instructions below.

## 1. Define a profile
## Prerequisites

??? info "Prerequisites"
Before running the example, ensure that you have [installed](../docs/installation/pip.md) `dstack` and [configured a project](../docs/guides/projects.md)
to use your preferred cloud account (AWS, GCP, Azure, or Lambda Cloud).
!!! info "NOTE:"
Before using `dstack` with a particular cloud, make sure to [configure](../docs/guides/projects.md) the corresponding project.

Each LLM model requires specific resources. To inform dstack about the required resources, you need to define a profile
in the `.dstack/profiles.yaml` file within your project.
Each LLM model requires specific resources. To inform `dstack` about the required resources, you need to
[define](../docs/reference/profiles.yml.md) a profile via the `.dstack/profiles.yaml` file within your project.

Each profile must include the project name, and you have the option to specify the GPU name, its memory, instance type,
retry policy, and more. Check the [reference](../docs/reference/profiles.yml.md) for more details.
Below is a profile that will provision a cloud instance with `24GB` of memory and a `T4` GPU in the `gcp` project.

<div editor-title=".dstack/profiles.yml">

Expand All @@ -25,16 +21,15 @@ profiles:
- name: gcp-t4
project: gcp
resources:
memory: 24GB
gpu:
name: T4
default: true
```
</div>
If you use this profile, dstack will utilize the project named `gcp` and a cloud instance that has an NVIDIA T4 GPU.

## 2. Run a Gradio app
## Run a Gradio app
Here's the configuration that runs the Gradio app:
Expand Down Expand Up @@ -65,32 +60,19 @@ Here's how you run it with `dstack`:

```shell
$ dstack run . -f gradio.dstack.yml
dstack will execute the following plan:
CONFIGURATION PROJECT INSTANCE RESOURCES SPOT
gradio.dstack.yml gcp n1-highmem-2 2xCPUs, 13312MB, 1xT4 auto
Continue? [y/n]: y
Provisioning and establishing an SSH tunnel...
Running on local URL: http://127.0.0.1:6006
To interrupt, press Ctrl+C...
```

</div>

After you confirm, `dstack` will provision the cloud instance, run the task, and forward the defined ports to your local
`dstack` will provision the cloud instance, run the task, and forward the defined ports to your local
machine for secure and convenient access.

![dstack-hub-create-azure-project.png](../assets/images/dstack-llmchat-welcome.png){ width=800 }
![](../assets/images/dstack-llmchat-gallery.png){ width=800 }

!!! info "NOTE:"
To use a non-default profile, simply specify its name with `--profile NAME` when using `dstack run`.

## 3. Run a Discord bot
## Run a Discord bot

Here's the configuration that runs the Gradio app:

Expand Down Expand Up @@ -130,25 +112,14 @@ Finally, here's how you run it with `dstack`:

```shell
$ dstack run . -f discord.dstack.yml
dstack will execute the following plan:
CONFIGURATION PROJECT INSTANCE RESOURCES SPOT
discord.dstack.yml gcp n1-highmem-2 2xCPUs, 13312MB, 1xT4 auto
Continue? [y/n]: y
Provisioning...
To interrupt, press Ctrl+C...
```

</div>

Once you confirm, `dstack` will provision the cloud instance and run the task. Once it's up, you can freely send messages
to your bot via Discord.

![dstack-hub-create-azure-project.png](../assets/images/dstack-llmchat-discord-chat.png){ width=800 }
![](../assets/images/dstack-llmchat-discord-chat.png){ width=800 }

For advanced commands supported by the bot, check the [README](https://github.com/deep-diver/LLM-As-Chatbot#discord-bot) file.

Expand Down
106 changes: 106 additions & 0 deletions docs/examples/vllm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# vLLM

Serving LLMs can be slow, even on expensive hardware. This example demonstrates how to utilize the
[`vllm`](https://vllm.ai/) library to serve LLMs with optimized performance.

## What is vLLM?

`vllm` is an open-source library that significantly increases LLM throughput, thanks to the optimized memory-sharing
algorithm called PageAttention.

The library also offers other benefits such as continuous batching,
GPU parallelism, streaming output, OpenAI-compatibility, and more.

To try `vllm` with `dstack`, follow the instructions below.

## Prerequisites

!!! info "NOTE:"
Before using `dstack` with a particular cloud, make sure to [configure](../docs/guides/projects.md) the corresponding project.

Each LLM model requires specific resources. To inform `dstack` about the required resources, you need to
[define](../docs/reference/profiles.yml.md) a profile via the `.dstack/profiles.yaml` file within your project.

Below is a profile that will provision a cloud instance with `24GB` of memory and a `T4` GPU in the `gcp` project.

<div editor-title=".dstack/profiles.yml">

```yaml
profiles:
- name: gcp-t4
project: gcp
resources:
memory: 24GB
gpu:
name: T4
default: true
```
</div>
## Run an endpoint
Here's the configuration that runs an LLM as an OpenAI-compatible endpoint:
<div editor-title="vllm/serve.dstack.yml">
```yaml
type: task

env:
# (Required) Specify the name of the model
- MODEL=facebook/opt-125m
# (Optional) Specify your Hugging Face token
- HUGGING_FACE_HUB_TOKEN=

ports:
- 8000

commands:
- conda install cuda # Required since vLLM will rebuild the CUDA kernel
- pip install vllm # Takes 5-10 minutes
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
```
</div>
Here's how you run it with `dstack`:

<div class="termy">

```shell
$ dstack run . -f vllm/serve.dstack.yml
```

</div>

`dstack` will provision the cloud instance, run the task, and forward the defined ports to your local
machine for secure and convenient access.

Now, you can query the endpoint in the same format as OpenAI API:

<div class="termy">

```shell
$ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
```

</div>

For more details on how `vllm` works, check their [documentation](https://vllm.readthedocs.io/).

[Source code](https://github.com/dstackai/dstack-examples){ .md-button .md-button--github }

## Limitations

To use `vllm` with `dstack`, be aware of the following limitations:

1. The `vllm` library currently supports a [limited set](https://vllm.readthedocs.io/en/latest/models/supported_models.html) of LLMs, but Llama 2 is supported.
2. The `vllm` library lacks quantization support. Check the progress [here](https://github.com/vllm-project/vllm/issues/316).
26 changes: 22 additions & 4 deletions docs/overrides/examples.html
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,37 @@ <h2>Examples</h2>
</div>

<div class="tx-landing__highlights_grid">
<a href="vllm">
<div class="feature-cell">
<div class="feature-icon">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="-3 -3 27 27">
<path d="m13.13 22.19-1.63-3.83c1.57-.58 3.04-1.36 4.4-2.27l-2.77 6.1M5.64 12.5l-3.83-1.63 6.1-2.77C7 9.46 6.22 10.93 5.64 12.5M21.61 2.39S16.66.269 11 5.93c-2.19 2.19-3.5 4.6-4.35 6.71-.28.75-.09 1.57.46 2.13l2.13 2.12c.55.56 1.37.74 2.12.46A19.1 19.1 0 0 0 18.07 13c5.66-5.66 3.54-10.61 3.54-10.61m-7.07 7.07c-.78-.78-.78-2.05 0-2.83s2.05-.78 2.83 0c.77.78.78 2.05 0 2.83-.78.78-2.05.78-2.83 0m-5.66 7.07-1.41-1.41 1.41 1.41M6.24 22l3.64-3.64c-.34-.09-.67-.24-.97-.45L4.83 22h1.41M2 22h1.41l4.77-4.76-1.42-1.41L2 20.59V22m0-2.83 4.09-4.08c-.21-.3-.36-.62-.45-.97L2 17.76v1.41Z"></path>
</svg>
</div>
<h3>
vLLM
</h3>

<p>
Serve open-source LLMs as OpenAI-compatible APIs with up to 24 times higher throughput using the vLLM library.
</p>
</div>
</a>

<a href="llmchat">
<div class="feature-cell">
<div class="feature-icon">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
<path d="M12 3c5.5 0 10 3.58 10 8s-4.5 8-10 8c-1.24 0-2.43-.18-3.53-.5C5.55 21 2 21 2 21c2.33-2.33 2.7-3.9 2.75-4.5C3.05 15.07 2 13.13 2 11c0-4.42 4.5-8 10-8m5 9v-2h-2v2h2m-4 0v-2h-2v2h2m-4 0v-2H7v2h2Z"></path>
</svg>
</svg>
</div>
<h3>
LLM as Chatbot
Chatbot
</h3>

<p>
Run an open-source LLM of your choice either as a Gradio chat app or as a Discord bot.
Run an open-source LLM of your choice, either as a Gradio app or as a Discord bot, with
internet search capability.
</p>
</div>
</a>
Expand Down
Loading

0 comments on commit 88bb4de

Please sign in to comment.