Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guardrail Loading Failed with Unexpected Large GPU Memory Requirement at Multi-GPU Server #328

Open
2 tasks
dawenxi-007 opened this issue Oct 25, 2024 · 0 comments

Comments

@dawenxi-007
Copy link

System Info

Python version: 3.10.12
Pytorch version:
llama_models version: 0.0.42
llama_stack version: 0.0.42
llama_stack_client version: 0.0.41
Hardware: 4xA100 (40GB VRAM/GPU)

local-gpu-run.yaml file content is as following:

version: '2'
built_at: '2024-10-11T00:06:23.964162'
image_name: local-gpu
docker_image: local-gpu
conda_env: null
apis:
- safety
- memory
- inference
- models
- agents
- memory_banks
- shields
providers:
  inference:
  - provider_id: meta0
    provider_type: meta-reference
    config:
      model: Llama3.1-8B-Instruct
      quantization: null
      torch_seed: null
      max_seq_len: 4096
      max_batch_size: 1
  - provider_id: meta1
    provider_type: meta-reference
    config:
      model: Llama-Guard-3-1B
      quantization: null
      torch_seed: null
      max_seq_len: 4096
      max_batch_size: 1
  safety:
  - provider_id: meta-reference
    provider_type: meta-reference
    config:
      llama_guard_shield:
        model: Llama-Guard-3-1B
        excluded_categories: []
      enable_prompt_guard: true
  memory:
  - provider_id: meta-reference
    provider_type: meta-reference
    config: {}
  agents:
  - provider_id: meta-reference
    provider_type: meta-reference
    config:
      persistence_store:
        namespace: null
        type: sqlite
        db_path: /home/dell/.llama/runtime/kvstore.db
  telemetry:
  - provider_id: meta-reference
    provider_type: meta-reference
    config: {}

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

Trying to load the model to initialize the host with the Command:
docker run --gpus=all -it -p 5000:5000 -v ~/.llama/builds/docker/local-gpu-run.yaml:/app/config.yaml -v ~/.llama:/root/.llama llamastack/distribution-meta-reference-gpu python -m llama_stack.distribution.server.server --yaml_config /app/config.yaml --port 5000

Error logs

Log file shows the following:

Loading model `Llama3.1-8B-Instruct`
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/opt/conda/lib/python3.10/site-packages/torch/__init__.py:696: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered
internally at /opt/conda/conda-bld/pytorch_1708025847130/work/torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Loaded in 8.79 seconds
Loaded model...
Loading model `Llama-Guard-3-1B`
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/opt/conda/lib/python3.10/site-packages/torch/__init__.py:696: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered
internally at /opt/conda/conda-bld/pytorch_1708025847130/work/torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Loaded in 4.46 seconds
Loaded model...
Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at /root/.llama/checkpoints/Prompt-Guard-86M and are newly initialized:  ...
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.                                                                                                                      [39/1803]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 343, in <module>
    fire.Fire(main)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 279, in main
    impls = asyncio.run(resolve_impls(config))
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/resolver.py", line 181, in resolve_impls
    impl = await instantiate_provider(
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/distribution/resolver.py", line 268, in instantiate_provider
    impl = await fn(*args)
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/__init__.py", line 16, in get_provider_impl
    await impl.initialize()
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/safety.py", line 40, in initialize
    _ = PromptGuardShield.instance(model_dir)
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/prompt_guard.py", line 37, in instance
    PromptGuardShield._instances[key] = PromptGuardShield(
  File "/opt/conda/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/safety/prompt_guard.py", line 66, in __init__
    model = AutoModelForSequenceClassification.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4091, in from_pretrained
    dispatch_model(model, **device_map_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/big_modeling.py", line 494, in dispatch_model
    model.to(device)
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2958, in to
    return super().to(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
    return self._apply(convert)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 39.38 GiB of which 144.25 MiB is free. Process 878221 has 16.76 GiB memory in use. Process 878311 has 3.98 GiB memory in use. Pr
ocess 878077 has 18.48 GiB memory in use. Of the allocated memory 18.08 GiB is allocated by PyTorch, and 1.27 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=exp
andable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Expected behavior

Expect that the models will be successfully loaded into the GPU VRAM with right memory consumption. Note same configuration does not give errors with a 1xH100 machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant