Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InternalError when running llava model #2966

Open
plufz opened this issue Oct 7, 2024 · 6 comments
Open

InternalError when running llava model #2966

plufz opened this issue Oct 7, 2024 · 6 comments
Labels
question Question about the usage

Comments

@plufz
Copy link

plufz commented Oct 7, 2024

❓ InternalError when running llava model

Im new to mlc-llm and I'm not sure if this is a bug or me doing something incorrectly. I have so far not managed to run any model successfully. I have tried to use a tiny llama model and a llava-1.5 model. I have converted the weights and compiled using the docs. And from what I could see from the output those steps succeed.

Environment

python 3.11 (conda)
macos 13.6.4
m1 max 64GB

Convert and compile

mlc_llm convert_weight /path/huggingface/hub/models--llava-hf--llava-1.5-7b-hf/snapshots/a272c74b2481d8aff3aa6fc2c4bf891fe57334fb \
    --quantization q4f16_1 \
    -o mlc-models/models--llava-hf--llava-1.5-7b-hf-MLC


mlc_llm gen_config /path/huggingface/hub/models--llava-hf--llava-1.5-7b-hf/snapshots/a272c74b2481d8aff3aa6fc2c4bf891fe57334fb \
    --quantization q4f16_1 --conv-template llava \
    -o mlc-models/models--llava-hf--llava-1.5-7b-hf-MLC

mlc_llm compile ./mlc-models/models--llava-hf--llava-1.5-7b-hf-MLC/mlc-chat-config.json \
    --device metal -o ./mlc-models/libs/models--llava-hf--llava-1.5-7b-hf-q4f16_1-metal.so

Running the model

from mlc_llm import MLCEngine

model = "./mlc-models/models--llava-hf--llava-1.5-7b-hf-MLC" 
model_lib = "./mlc-models/libs/models--llava-hf--llava-1.5-7b-hf-q4f16_1-metal.so"
engine = MLCEngine(model=model, model_lib=model_lib)

for response in engine.chat.completions.create(
    messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "<image>What is shown in this image?",
                    },
                ],
            }],
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)

engine.terminate()

Output

Also observe that the process does not stop after the output. I need to ctrl-c to stop it.

[2024-10-07 14:37:17] INFO auto_device.py:88: Not found device: cuda:0
[2024-10-07 14:37:18] INFO auto_device.py:88: Not found device: rocm:0
[2024-10-07 14:37:19] INFO auto_device.py:79: Found device: metal:0
[2024-10-07 14:37:19] INFO auto_device.py:88: Not found device: vulkan:0
[2024-10-07 14:37:20] INFO auto_device.py:88: Not found device: opencl:0
[2024-10-07 14:37:20] INFO auto_device.py:35: Using device: metal:0
[2024-10-07 14:37:20] INFO engine_base.py:143: Using library model: ./mlc-models/libs/models--llava-hf--llava-1.5-7b-hf-q4f16_1-metal.so
[2024-10-07 14:37:20] INFO engine_base.py:180: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory.
[2024-10-07 14:37:20] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[2024-10-07 14:37:20] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[14:37:20] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 4096, prefill chunk size will be set to 4096. 
[14:37:20] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 4096, prefill chunk size will be set to 4096. 
[14:37:20] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 128, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 4096. 
[14:37:20] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "local". So max batch size is 4, max KV cache token capacity is 4096, prefill chunk size is 4096.
[14:37:20] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 7007.820 MB (Parameters: 3790.764 MB. KVCache: 2225.037 MB. Temporary buffer: 992.019 MB). The actual usage might be slightly larger than the estimated number.
[14:37:23] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/runtime/memory/pooled_allocator.h:65: Warning: PooledAllocator got InternalError during allocation: InternalError: Check failed: (buf != nil) is false: 
[14:37:23] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/runtime/memory/pooled_allocator.h:66: Warning: Trying to release all unused memory and reallocate...
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/path/to/project/env/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/path/to/project/env/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
  File "/path/to/project/env/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm.error.InternalError: Traceback (most recent call last):
  File "/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/runtime/metal/metal_device_api.mm", line 194
InternalError: Check failed: (buf != nil) is false: `
@plufz plufz added the question Question about the usage label Oct 7, 2024
@plufz
Copy link
Author

plufz commented Oct 10, 2024

Any input on this problem would be very helpful for me since I have not managed to solve this myself yet. To begin with, should I interpret this as an out of memory error or just a bug while allocating memory? I have lots of free ram, but maybe I have missed some internal config limit in mlc-llm?

I have now also tried:

  • downloading a llava model precompiled for mlc from hf and I get the same error on those models (to rule out me making some mistake while compiling the model).
  • Load the model via mlc chat, which works, I can send a normal text prompt to it, and it responds like a normal text llm. But I don't know if I somehow can make a multimodal request via mlc chat for debugging purposes?
  • When I try to make a json request with an image and text prompt I get the stack trace above. Both via mlc_llm serve and via the python script above.
  • I have also started using the same example REST request from the llava issue for the llava PR, to be sure it wasn’t some nuanced bug using a slightly different request. ([Model] [Serve] Add support for LLaVa model in serving engine #1974)

@naseeks
Copy link

naseeks commented Oct 11, 2024

I ran into the same problem as you did!
Generated the complied library of the same model and ran it on both, Apple Metal and a CUDA machine. Both cases got the same error of some memory issues.
Hard to debug when using a library, but I then used instead of the image you have linked above, another much smaller image (2Kb image) and it worked....

but not sure how to take it from here! I dont understand how one image can cause out-of-bounds memory issues or something!

@plufz
Copy link
Author

plufz commented Oct 11, 2024

Okay, thanks a lot for your reply, atleast I'm not alone.

but not sure how to take it from here! I dont understand how one image can cause out-of-bounds memory issues or something!

Yeah I also find it weird, and the python process does not seem to allocate insane amounts of memory of anything. There is plenty left.

It would be really nice to get any type of answer from a maintainer, just so I know if I should move on to another backend or keep on trying.

@naseeks
Copy link

naseeks commented Oct 11, 2024

Hey plufz!
I think the issue might be that Llava expects an image with dimensions < 336 (=image_size), so in your code, try to resize your image to 336x336 and then recast back to a url type and pass it along... something like this:

response = requests.get(url)
img = Image.open(BytesIO(response.content))
img_resized = img.resize((336, 336))

img_byte_arr = BytesIO()
img_resized.save(img_byte_arr, format="JPEG")
img_byte_arr = img_byte_arr.getvalue()

new_url = (
    f"data:image/jpeg;base64,{base64.b64encode(img_byte_arr).decode('utf-8')}"
)

336 is from the model config:
https://huggingface.co/llava-hf/llava-1.5-7b-hf/blob/main/config.json

@plufz
Copy link
Author

plufz commented Oct 13, 2024

I think the issue might be that Llava expects an image with dimensions < 336

Thanks a million, that was it.

Is this a bug? Because from what I can see there is a function in the llava model that scales images according to the config image size value. I didn't dig into when this scaling function is called though.

@naseeks
Copy link

naseeks commented Oct 14, 2024

yeah should keep this issue open, until we get a response from any of the Llava maintainers! seems like a bug that needs to be addressed/fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about the usage
Projects
None yet
Development

No branches or pull requests

2 participants