-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EAGLE-4773] Nvidia NIM dockerfile #444
base: master
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,71 @@ | |||
FROM nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.2 as build | |||
|
|||
FROM gcr.io/distroless/python3-debian12:debug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont use debug for final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting error when trying to run model non-root
image gcr.io/distroless/python3-debian12:nonroot-8701094b7fe8ff30d0777bbdfcc9a65caff6f40b
INFO 11-19 13:21:02.337 ngc_injector.py:218] Preparing model workspace. This step might download additional files to run the model.
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:117] One or more errors fetching files:
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
[11-19 13:21:06.914 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] I/O error Permission denied (os error 13)
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/nim/llm/vllm_nvext/entrypoints/launch.py", line 99, in <module>
main()
File "/opt/nim/llm/vllm_nvext/entrypoints/launch.py", line 42, in main
inference_env = prepare_environment()
File "/opt/nim/llm/vllm_nvext/entrypoints/args.py", line 143, in prepare_environment
engine_args, extracted_name = inject_ngc_hub(engine_args)
File "/opt/nim/llm/vllm_nvext/hub/ngc_injector.py", line 220, in inject_ngc_hub
cached = repo.get_all()
Exception: I/O error Permission denied (os error 13)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u dont need nonroot, just use the normal one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okk, will try it
COPY --from=build /bin/bash /bin/bash | ||
COPY --from=build /bin/sh /bin/sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need both? not using restricted shell?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some issue with restricted shell, that's why need to use /bin/bash
COPY --from=build /opt/nim/llm/.venv/bin/python3.10 /usr/bin/python3 | ||
COPY --from=build /opt/nim/llm/.venv/bin/python3.10 /usr/local/bin/python3.10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont need these if we're copying in the entire /opt
directory L:26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I think this is needed because we are overwriting the distroless image python3 binary from NIM image
ENV PYTHONPATH=${PYTHONPATH}:/opt/nim/llm/.venv/lib/python3.10/site-packages:/opt/nim/llm | ||
ENV PATH="/opt/nim/llm/.venv/bin:/opt/hpcx/ucc/bin:/opt/hpcx/ucx/bin:/opt/hpcx/ompi/bin:$PATH" | ||
|
||
ENV LD_LIBRARY_PATH="/opt/hpcx/ucc/lib/ucc:/opt/hpcx/ucc/lib:/opt/hpcx/ucx/lib/ucx:/opt/hpcx/ucx/lib:/opt/hpcx/ompi/lib:/opt/hpcx/ompi/lib/openmpi:/opt/nim/llm/.venv/lib/python3.10/site-packages/tensorrt_llm/libs:/opt/nim/llm/.venv/lib/python3.10/site-packages/nvidia/cublas/lib:/opt/nim/llm/.venv/lib/python3.10/site-packages/tensorrt_libs:/opt/nim/llm/.venv/lib/python3.10/site-packages/nvidia/nccl/lib:$LD_LIBRARY_PATH" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are all of these needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about all of these, but I think will need to include which are at the path /opt/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw I look into all the environment variables from the NIM image and then included into this image
# Now sure about below `ldconfig` command now, before CUDA libraries wasn't found without running `ldconfig` but not it seems to be working | ||
# Run ldconfig in the build stage to update the library cache else CUDA libraries won't be found | ||
RUN ldconfig -v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt need this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test again without using debug to confirm everything works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refer this: #444 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we separate this into a base image and this template? feels overloaded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, This looks overloaded to me too and I tried to separate this into a base image but the issue is there is a separate NIM image for every model, so not sure we can do that
What
Why
How
Tests
Notes
echo "$API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin
Alan mentioned we can authenticate inside a bash script, we can do something similar to how we authenticate ECR for buildkit https://github.com/Clarifai/models-images/blob/main/buildkit/main.bash#L77-L78