Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EAGLE-4773] Nvidia NIM dockerfile #444

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Conversation

luv-bansal
Copy link
Contributor

@luv-bansal luv-bansal commented Nov 13, 2024

What

  • Nvidia NIM dockerfile to integrate NIM

Why

How

Tests

Notes

  • Main part left is to get authentication from NGC.
    echo "$API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin

Alan mentioned we can authenticate inside a bash script, we can do something similar to how we authenticate ECR for buildkit https://github.com/Clarifai/models-images/blob/main/buildkit/main.bash#L77-L78

@luv-bansal luv-bansal requested a review from wemoveon2 November 13, 2024 08:57
@luv-bansal luv-bansal changed the title Nvidia NIM dockerfile [EAGLE-4773] Nvidia NIM dockerfile Nov 13, 2024
@@ -0,0 +1,71 @@
FROM nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.2 as build

FROM gcr.io/distroless/python3-debian12:debug
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont use debug for final

Copy link
Contributor Author

@luv-bansal luv-bansal Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting error when trying to run model non-root image gcr.io/distroless/python3-debian12:nonroot-8701094b7fe8ff30d0777bbdfcc9a65caff6f40b

INFO 11-19 13:21:02.337 ngc_injector.py:218] Preparing model workspace. This step might download additional files to run the model.
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:117] One or more errors fetching files:
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR  nim_sdk::hub::repo  rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/nim/llm/vllm_nvext/entrypoints/launch.py", line 99, in <module>
    main()
  File "/opt/nim/llm/vllm_nvext/entrypoints/launch.py", line 42, in main
    inference_env = prepare_environment()
  File "/opt/nim/llm/vllm_nvext/entrypoints/args.py", line 143, in prepare_environment
    engine_args, extracted_name = inject_ngc_hub(engine_args)
  File "/opt/nim/llm/vllm_nvext/hub/ngc_injector.py", line 220, in inject_ngc_hub
    cached = repo.get_all()
Exception: I/O error Permission denied (os error 13)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u dont need nonroot, just use the normal one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okk, will try it

Comment on lines 6 to 7
COPY --from=build /bin/bash /bin/bash
COPY --from=build /bin/sh /bin/sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need both? not using restricted shell?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some issue with restricted shell, that's why need to use /bin/bash

Comment on lines 10 to 11
COPY --from=build /opt/nim/llm/.venv/bin/python3.10 /usr/bin/python3
COPY --from=build /opt/nim/llm/.venv/bin/python3.10 /usr/local/bin/python3.10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need these if we're copying in the entire /opt directory L:26

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I think this is needed because we are overwriting the distroless image python3 binary from NIM image

ENV PYTHONPATH=${PYTHONPATH}:/opt/nim/llm/.venv/lib/python3.10/site-packages:/opt/nim/llm
ENV PATH="/opt/nim/llm/.venv/bin:/opt/hpcx/ucc/bin:/opt/hpcx/ucx/bin:/opt/hpcx/ompi/bin:$PATH"

ENV LD_LIBRARY_PATH="/opt/hpcx/ucc/lib/ucc:/opt/hpcx/ucc/lib:/opt/hpcx/ucx/lib/ucx:/opt/hpcx/ucx/lib:/opt/hpcx/ompi/lib:/opt/hpcx/ompi/lib/openmpi:/opt/nim/llm/.venv/lib/python3.10/site-packages/tensorrt_llm/libs:/opt/nim/llm/.venv/lib/python3.10/site-packages/nvidia/cublas/lib:/opt/nim/llm/.venv/lib/python3.10/site-packages/tensorrt_libs:/opt/nim/llm/.venv/lib/python3.10/site-packages/nvidia/nccl/lib:$LD_LIBRARY_PATH"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all of these needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about all of these, but I think will need to include which are at the path /opt/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I look into all the environment variables from the NIM image and then included into this image

Comment on lines 67 to 69
# Now sure about below `ldconfig` command now, before CUDA libraries wasn't found without running `ldconfig` but not it seems to be working
# Run ldconfig in the build stage to update the library cache else CUDA libraries won't be found
RUN ldconfig -v
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt need this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test again without using debug to confirm everything works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer this: #444 (comment)

@luv-bansal luv-bansal marked this pull request as ready for review November 20, 2024 09:51
@luv-bansal luv-bansal requested a review from wemoveon2 November 20, 2024 09:51
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we separate this into a base image and this template? feels overloaded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, This looks overloaded to me too and I tried to separate this into a base image but the issue is there is a separate NIM image for every model, so not sure we can do that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants