[BUG] CUDA out of memory. Tried to allocate ... #750

unclemusclez · 2024-09-03T01:59:09Z

Prerequisites

I have read the documentation.
I have checked other issues for similar problems.

Backend

Other cloud providers

Interface Used

CLI

CLI Command

i though this issue was me originally with AMD, but this is a cloud NVIDIA L40/A10G from LightningAI.

see:
ROCm/flash-attention#79 (comment)
pytorch/pytorch#134208 (comment)

In all instances, I can run the same training on CPU. Maybe this is specific to Pytorch? Ive tried this on 7900xt 20GB VRAM, L40 40GB VRAM, my personal 7800xt with 64GB DDR5, and then pretty much any cloud instance this works on the cpu, its jsut slow and not worth it.

UI Screenshots & Parameters

task: sentence-transformers:pair_score
base_model: HuggingFaceTB/SmolLM-360M-Instruct
project_name: SmolLM-360M-Instruct-DEVINator-1-1-69-GPU-2
backend: local
log: tensorboard

data:
  path: skratos115/opendevin_DataDevinator
  train_split: train
  valid_split: null
  column_mapping:
    sentence1_column: prompt
    sentence2_column: solution
    target_column: grade

params:
  max_seq_length: 8192
  epochs: 1
  batch_size: 64
  lr: 1e-5
  optimizer: adamw_torch
  scheduler: linear
  gradient_accumulation: 1
  mixed_precision: fp16
  seed: 69

Error Logs

ERROR    | 2024-09-03 01:49:03 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/autotrain/trainers/common.py", line 117, in wrapper
    return func(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/autotrain/trainers/sent_transformers/__main__.py", line 213, in train
    trainer.train()
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/trainer.py", line 3318, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/sentence_transformers/trainer.py", line 329, in compute_loss
    loss = loss_fn(features, labels)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/sentence_transformers/losses/CoSENTLoss.py", line 79, in forward
    embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/sentence_transformers/losses/CoSENTLoss.py", line 79, in <listcomp>
    embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/accelerate/utils/operations.py", line 819, in forward
    return model_forward(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/accelerate/utils/operations.py", line 807, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py", line 118, in forward
    output_states = self.auto_model(**trans_features, return_dict=False)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1001, in forward
    layer_outputs = decoder_layer(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 750, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 309, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 378.00 MiB. GPU 0 has a total capacity of 44.53 GiB of which 321.25 MiB is free. Process 45714 has 44.21 GiB memory in use. Of the allocated memory 43.58 GiB is allocated by PyTorch, and 135.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

ERROR    | 2024-09-03 01:49:03 | autotrain.trainers.common:wrapper:121 - CUDA out of memory. Tried to allocate 378.00 MiB. GPU 0 has a total capacity of 44.53 GiB of which 321.25 MiB is free. Process 45714 has 44.21 GiB memory in use. Of the allocated memory 43.58 GiB is allocated by PyTorch, and 135.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Additional Information

No response

The text was updated successfully, but these errors were encountered:

unclemusclez · 2024-09-15T20:35:17Z

I have had repeatable success so this is probably something i was doing wrong.

Most likely i was was using too large of a context window. for the amount of VRAM i have. I was under the impression it would use VRAM and local memory, but this doesn't seem to be the case.

I will investigate, it may be that my virtual memory is split between my embedded GPU and discrete GPU. if i run on straight CPU, i can manage larger models, and anecdotally speaking, i have been told that i can train larger models on less VRAM, but their training methods may be different. A lot of people use llama factory or mergekit, and i am working particularly with autotrain and sentence transformers.

I appreciate the help. you guys know where to find me.

unclemusclez · 2024-09-15T20:35:27Z

<3

unclemusclez added the bug Something isn't working label Sep 3, 2024

unclemusclez closed this as completed Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CUDA out of memory. Tried to allocate ... #750

[BUG] CUDA out of memory. Tried to allocate ... #750

unclemusclez commented Sep 3, 2024 •

edited

Loading

unclemusclez commented Sep 15, 2024 •

edited

Loading

unclemusclez commented Sep 15, 2024

[BUG] CUDA out of memory. Tried to allocate ... #750

[BUG] CUDA out of memory. Tried to allocate ... #750

Comments

unclemusclez commented Sep 3, 2024 • edited Loading

Prerequisites

Backend

Interface Used

CLI Command

UI Screenshots & Parameters

Error Logs

Additional Information

unclemusclez commented Sep 15, 2024 • edited Loading

unclemusclez commented Sep 15, 2024

unclemusclez commented Sep 3, 2024 •

edited

Loading

unclemusclez commented Sep 15, 2024 •

edited

Loading