llama 3 8B does not fit in 4090 GPU #27168

aurotripathy · 2024-10-07T18:22:25Z

aurotripathy
Oct 7, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

The Llama 3.1 8GB FP16 model should fit in a 24BG 4090 (8x2GB=16GB + activation + KV-cache), it does not,

kindly see error message.

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

from langchain_huggingface import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="meta-llama/Meta-Llama-3-8B-Instruct",
    device=0,
    task="text-generation",
    pipeline_kwargs={
        "max_new_tokens": 200,
        "top_k": 10,
        "temperature": 0.1,
    },
)

Description

OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB. GPU 0 has a total capacity of 23.55 GiB of which 16.69 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 23.15 GiB is allocated by PyTorch, and 1.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

System Info

langchain==0.3.2
langchain-community==0.3.1
langchain-core==0.3.9
langchain-text-splitters==0.3.0
Note: you may need to restart the kernel to use updated packages.

Python 3.10.12

container:
nvcr.io/nvidia/pytorch_24.07-py3/jupyter

@aurotripathy · 2024-10-07T18:23:38Z

dosubot[bot]
bot Oct 7, 2024

Hey, @aurotripathy! I'm Dosu, a bot here to help you with bug fixes, questions, and contributions. Let's tackle this issue together!

To resolve the OutOfMemoryError with the Meta-Llama-3-8B-Instruct model on a 4090 GPU, try the following steps:

Clear CUDA Cache: Use torch.cuda.empty_cache() to free up unused memory.
Garbage Collection: Use gc.collect() to clear Python's garbage collector.
Reduce Batch Size and Sequence Length: Lower these values to reduce memory usage.
Model Quantization: Use a quantized version of the model if available.
Check GPU Memory Usage: Use !nvidia-smi to monitor and ensure no other processes are consuming memory.

Here's a code snippet to help manage memory:

import gc
import torch

torch.cuda.empty_cache()
gc.collect()
!nvidia-smi

These steps can help manage memory usage and potentially resolve the OutOfMemoryError ^[1].

To continue talking to Dosu, mention @dosu.

1 reply

aurotripathy Oct 7, 2024
Author

thank you, same error as before

alpatiev · 2025-02-25T17:07:54Z

alpatiev
Feb 25, 2025

First of all, I'd suggest checking the other processes usingnvidia-smi, also, maybe try max_new_tokens=50 .
Hope it helps!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama 3 8B does not fit in 4090 GPU #27168

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

llama 3 8B does not fit in 4090 GPU #27168

aurotripathy Oct 7, 2024

Checked other resources

Commit to Help

Example Code

The Llama 3.1 8GB FP16 model should fit in a 24BG 4090 (8x2GB=16GB + activation + KV-cache), it does not,

Description

System Info

Replies: 2 comments · 1 reply

dosubot[bot] bot Oct 7, 2024

aurotripathy Oct 7, 2024 Author

alpatiev Feb 25, 2025

aurotripathy
Oct 7, 2024

Replies: 2 comments 1 reply

dosubot[bot]
bot Oct 7, 2024

aurotripathy Oct 7, 2024
Author

alpatiev
Feb 25, 2025