- [Docs]: Added vLLM example

- [Landing]: Replaced the highlights section with examples
dstackai · Jul 23, 2023 · 88bb4de · 88bb4de
1 parent 416212b
commit 88bb4de
Show file tree

Hide file tree

Showing 6 changed files with 213 additions and 184 deletions.
diff --git a/docs/assets/images/dstack-llmchat-welcome.png b/docs/assets/images/dstack-llmchat-welcome.png
diff --git a/docs/examples/llmchat.md b/docs/examples/llmchat.md
@@ -1,22 +1,18 @@
-# LLM as Chatbot
-
-![dstack-hub-create-azure-project.png](../assets/images/dstack-llmchat-gallery.png){ width=800 }
+# Chatbot
 
 This [example](https://github.com/deep-diver/LLM-As-Chatbot) is built by Chansung Park. It can run any open-source LLM either as a Gradio chat app or as a Discord bot.
 With `dstack`, you can run this Gradio chat app or Discord bot in any cloud with a single command.
-To give it a try, follow the instructions below.
+To try this example with `dstack`, follow the instructions below.
 
-## 1. Define a profile
+## Prerequisites
 
-??? info "Prerequisites"
-    Before running the example, ensure that you have [installed](../docs/installation/pip.md) `dstack` and [configured a project](../docs/guides/projects.md) 
-    to use your preferred cloud account (AWS, GCP, Azure, or Lambda Cloud).
+!!! info "NOTE:"
+    Before using `dstack` with a particular cloud, make sure to [configure](../docs/guides/projects.md) the corresponding project.
 
-Each LLM model requires specific resources. To inform dstack about the required resources, you need to define a profile
-in the `.dstack/profiles.yaml` file within your project.
+Each LLM model requires specific resources. To inform `dstack` about the required resources, you need to 
+[define](../docs/reference/profiles.yml.md) a profile via the `.dstack/profiles.yaml` file within your project.
 
-Each profile must include the project name, and you have the option to specify the GPU name, its memory, instance type,
-retry policy, and more. Check the [reference](../docs/reference/profiles.yml.md) for more details.
+Below is a profile that will provision a cloud instance with `24GB` of memory and a `T4` GPU in the `gcp` project.
 
 <div editor-title=".dstack/profiles.yml"> 
 
@@ -25,16 +21,15 @@ profiles:
   - name: gcp-t4
     project: gcp
     resources:
+      memory: 24GB
       gpu:
         name: T4
     default: true
 ```
 
 </div>
 
-If you use this profile, dstack will utilize the project named `gcp` and a cloud instance that has an NVIDIA T4 GPU.
-
-## 2. Run a Gradio app
+## Run a Gradio app
 
 Here's the configuration that runs the Gradio app:
 
@@ -65,32 +60,19 @@ Here's how you run it with `dstack`:
 
 ```shell
 $ dstack run . -f gradio.dstack.yml
-
-dstack will execute the following plan:
-
-CONFIGURATION      PROJECT  INSTANCE      RESOURCES              SPOT
-gradio.dstack.yml  gcp      n1-highmem-2  2xCPUs, 13312MB, 1xT4  auto
-
-Continue? [y/n]: y
-
-Provisioning and establishing an SSH tunnel...
-
-Running on local URL:  http://127.0.0.1:6006
-
-To interrupt, press Ctrl+C...
 ```
 
 </div>
 
-After you confirm, `dstack` will provision the cloud instance, run the task, and forward the defined ports to your local
+`dstack` will provision the cloud instance, run the task, and forward the defined ports to your local
 machine for secure and convenient access.
 
-![dstack-hub-create-azure-project.png](../assets/images/dstack-llmchat-welcome.png){ width=800 }
+![](../assets/images/dstack-llmchat-gallery.png){ width=800 }
 
 !!! info "NOTE:"
     To use a non-default profile, simply specify its name with `--profile NAME` when using `dstack run`.
 
-## 3. Run a Discord bot
+## Run a Discord bot
 
 Here's the configuration that runs the Gradio app:
 
@@ -130,25 +112,14 @@ Finally, here's how you run it with `dstack`:
 
 ```shell
 $ dstack run . -f discord.dstack.yml
-
-dstack will execute the following plan:
-
-CONFIGURATION       PROJECT  INSTANCE      RESOURCES              SPOT
-discord.dstack.yml  gcp      n1-highmem-2  2xCPUs, 13312MB, 1xT4  auto
-
-Continue? [y/n]: y
-
-Provisioning...
-
-To interrupt, press Ctrl+C...
 ```
 
 </div>
 
 Once you confirm, `dstack` will provision the cloud instance and run the task. Once it's up, you can freely send messages
 to your bot via Discord.
 
-![dstack-hub-create-azure-project.png](../assets/images/dstack-llmchat-discord-chat.png){ width=800 }
+![](../assets/images/dstack-llmchat-discord-chat.png){ width=800 }
 
 For advanced commands supported by the bot, check the [README](https://github.com/deep-diver/LLM-As-Chatbot#discord-bot) file.
 

diff --git a/docs/examples/vllm.md b/docs/examples/vllm.md
@@ -0,0 +1,106 @@
+# vLLM
+
+Serving LLMs can be slow, even on expensive hardware. This example demonstrates how to utilize the 
+[`vllm`](https://vllm.ai/) library to serve LLMs with optimized performance.
+
+## What is vLLM?
+
+`vllm` is an open-source library that significantly increases LLM throughput, thanks to the optimized memory-sharing
+algorithm called PageAttention.
+
+The library also offers other benefits such as continuous batching, 
+GPU parallelism, streaming output, OpenAI-compatibility, and more.
+
+To try `vllm` with `dstack`, follow the instructions below.
+
+## Prerequisites
+
+!!! info "NOTE:"
+    Before using `dstack` with a particular cloud, make sure to [configure](../docs/guides/projects.md) the corresponding project.
+
+Each LLM model requires specific resources. To inform `dstack` about the required resources, you need to 
+[define](../docs/reference/profiles.yml.md) a profile via the `.dstack/profiles.yaml` file within your project.
+
+Below is a profile that will provision a cloud instance with `24GB` of memory and a `T4` GPU in the `gcp` project.
+
+<div editor-title=".dstack/profiles.yml"> 
+
+```yaml
+profiles:
+  - name: gcp-t4
+    project: gcp
+    resources:
+      memory: 24GB
+      gpu:
+        name: T4
+    default: true
+```
+
+</div>
+
+## Run an endpoint
+
+Here's the configuration that runs an LLM as an OpenAI-compatible endpoint:
+
+<div editor-title="vllm/serve.dstack.yml"> 
+
+```yaml
+type: task
+
+env:
+  # (Required) Specify the name of the model
+  - MODEL=facebook/opt-125m
+  # (Optional) Specify your Hugging Face token
+  - HUGGING_FACE_HUB_TOKEN=
+
+ports:
+  - 8000
+
+commands:
+  - conda install cuda # Required since vLLM will rebuild the CUDA kernel
+  - pip install vllm # Takes 5-10 minutes
+  - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
+```
+
+</div>
+
+Here's how you run it with `dstack`:
+
+<div class="termy">
+
+```shell
+$ dstack run . -f vllm/serve.dstack.yml
+```
+
+</div>
+
+`dstack` will provision the cloud instance, run the task, and forward the defined ports to your local
+machine for secure and convenient access.
+
+Now, you can query the endpoint in the same format as OpenAI API:
+
+<div class="termy">
+
+```shell
+$ curl http://localhost:8000/v1/completions \
+      -H "Content-Type: application/json" \
+      -d '{
+          "model": "facebook/opt-125m",
+          "prompt": "San Francisco is a",
+          "max_tokens": 7,
+          "temperature": 0
+      }'
+```
+
+</div>
+
+For more details on how `vllm` works, check their [documentation](https://vllm.readthedocs.io/).
+
+[Source code](https://github.com/dstackai/dstack-examples){ .md-button .md-button--github }
+
+## Limitations
+
+To use `vllm` with `dstack`, be aware of the following limitations:
+
+1. The `vllm` library currently supports a [limited set](https://vllm.readthedocs.io/en/latest/models/supported_models.html) of LLMs, but Llama 2 is supported.
+2. The `vllm` library lacks quantization support. Check the progress [here](https://github.com/vllm-project/vllm/issues/316).
diff --git a/docs/overrides/examples.html b/docs/overrides/examples.html
@@ -9,19 +9,37 @@ <h2>Examples</h2>
             </div>
 
             <div class="tx-landing__highlights_grid">
+                <a href="vllm">
+                    <div class="feature-cell">
+                        <div class="feature-icon">
+                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="-3 -3 27 27">
+                                <path d="m13.13 22.19-1.63-3.83c1.57-.58 3.04-1.36 4.4-2.27l-2.77 6.1M5.64 12.5l-3.83-1.63 6.1-2.77C7 9.46 6.22 10.93 5.64 12.5M21.61 2.39S16.66.269 11 5.93c-2.19 2.19-3.5 4.6-4.35 6.71-.28.75-.09 1.57.46 2.13l2.13 2.12c.55.56 1.37.74 2.12.46A19.1 19.1 0 0 0 18.07 13c5.66-5.66 3.54-10.61 3.54-10.61m-7.07 7.07c-.78-.78-.78-2.05 0-2.83s2.05-.78 2.83 0c.77.78.78 2.05 0 2.83-.78.78-2.05.78-2.83 0m-5.66 7.07-1.41-1.41 1.41 1.41M6.24 22l3.64-3.64c-.34-.09-.67-.24-.97-.45L4.83 22h1.41M2 22h1.41l4.77-4.76-1.42-1.41L2 20.59V22m0-2.83 4.09-4.08c-.21-.3-.36-.62-.45-.97L2 17.76v1.41Z"></path>
+                            </svg>
+                        </div>
+                        <h3>
+                            vLLM
+                        </h3>
+
+                        <p>
+                            Serve open-source LLMs as OpenAI-compatible APIs with up to 24 times higher throughput using the vLLM library.
+                        </p>
+                    </div>
+                </a>
+
                 <a href="llmchat">
                     <div class="feature-cell">
                         <div class="feature-icon">
-                            <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
+                             <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24">
                                 <path d="M12 3c5.5 0 10 3.58 10 8s-4.5 8-10 8c-1.24 0-2.43-.18-3.53-.5C5.55 21 2 21 2 21c2.33-2.33 2.7-3.9 2.75-4.5C3.05 15.07 2 13.13 2 11c0-4.42 4.5-8 10-8m5 9v-2h-2v2h2m-4 0v-2h-2v2h2m-4 0v-2H7v2h2Z"></path>
-                            </svg>
+                             </svg>
                         </div>
                         <h3>
-                            LLM as Chatbot
+                            Chatbot
                         </h3>
 
                         <p>
-                            Run an open-source LLM of your choice either as a Gradio chat app or as a Discord bot.
+                            Run an open-source LLM of your choice, either as a Gradio app or as a Discord bot, with
+                            internet search capability.
                         </p>
                     </div>
                 </a>