Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Commit

Permalink
Update colab links
Browse files Browse the repository at this point in the history
  • Loading branch information
mgoin authored Mar 7, 2024
1 parent 8d617e5 commit 3ae527f
Show file tree
Hide file tree
Showing 3 changed files with 3 additions and 3 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Deploy Compressed LLMs from Hugging Face with nm-vllm

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuralmagic/nm-vllm/blob/main/examples-neuralmagic/deploy_compressed_huggingface_models/Deploy_Compressed_LLMs_from_Hugging_Face_with_nm_vllm.ipynb)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://bit.ly/4a3K5Iw)


This notebook walks through how to deploy compressed models with nm-vllm's latest memory and performance optimizations.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Performantly Quantize LLMs to 4-bits with Marlin and nm-vllm

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuralmagic/nm-vllm/blob/main/examples-neuralmagic/marlin_quantization_and_deploy/Performantly_Quantize_LLMs_to_4_bits_with_Marlin_and_nm_vllm.ipynb)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://bit.ly/3uY6NTx)

This notebook walks through how to compress a pretrained LLM and deploy it with `nm-vllm`. To create a new 4-bit quantized model, we can leverage AutoGPTQ. Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%. The main benefits are lower latency and memory usage.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Apply SparseGPT to LLMs and deploy with nm-vllm

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuralmagic/nm-vllm/blob/main/examples-neuralmagic/sparsegpt_compress_and_deploy/Apply_SparseGPT_to_LLMs_and_deploy_with_nm_vllm.ipynb)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://bit.ly/4c5jT1S)


This notebook walks through how to sparsify a pretrained LLM. To create a pruned model, you can leverage SparseGPT. Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%. The main benefits are lower latency and memory usage.
Expand Down

0 comments on commit 3ae527f

Please sign in to comment.