InternalError when running llava model #2966

plufz · 2024-10-07T12:40:48Z

❓ InternalError when running llava model

Im new to mlc-llm and I'm not sure if this is a bug or me doing something incorrectly. I have so far not managed to run any model successfully. I have tried to use a tiny llama model and a llava-1.5 model. I have converted the weights and compiled using the docs. And from what I could see from the output those steps succeed.

Environment

python 3.11 (conda)
macos 13.6.4
m1 max 64GB

Convert and compile

mlc_llm convert_weight /path/huggingface/hub/models--llava-hf--llava-1.5-7b-hf/snapshots/a272c74b2481d8aff3aa6fc2c4bf891fe57334fb \
    --quantization q4f16_1 \
    -o mlc-models/models--llava-hf--llava-1.5-7b-hf-MLC


mlc_llm gen_config /path/huggingface/hub/models--llava-hf--llava-1.5-7b-hf/snapshots/a272c74b2481d8aff3aa6fc2c4bf891fe57334fb \
    --quantization q4f16_1 --conv-template llava \
    -o mlc-models/models--llava-hf--llava-1.5-7b-hf-MLC

mlc_llm compile ./mlc-models/models--llava-hf--llava-1.5-7b-hf-MLC/mlc-chat-config.json \
    --device metal -o ./mlc-models/libs/models--llava-hf--llava-1.5-7b-hf-q4f16_1-metal.so

Running the model

from mlc_llm import MLCEngine

model = "./mlc-models/models--llava-hf--llava-1.5-7b-hf-MLC" 
model_lib = "./mlc-models/libs/models--llava-hf--llava-1.5-7b-hf-q4f16_1-metal.so"
engine = MLCEngine(model=model, model_lib=model_lib)

for response in engine.chat.completions.create(
    messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
                        },
                    },
                    {
                        "type": "text",
                        "text": "<image>What is shown in this image?",
                    },
                ],
            }],
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)

engine.terminate()

Output

Also observe that the process does not stop after the output. I need to ctrl-c to stop it.

[2024-10-07 14:37:17] INFO auto_device.py:88: Not found device: cuda:0
[2024-10-07 14:37:18] INFO auto_device.py:88: Not found device: rocm:0
[2024-10-07 14:37:19] INFO auto_device.py:79: Found device: metal:0
[2024-10-07 14:37:19] INFO auto_device.py:88: Not found device: vulkan:0
[2024-10-07 14:37:20] INFO auto_device.py:88: Not found device: opencl:0
[2024-10-07 14:37:20] INFO auto_device.py:35: Using device: metal:0
[2024-10-07 14:37:20] INFO engine_base.py:143: Using library model: ./mlc-models/libs/models--llava-hf--llava-1.5-7b-hf-q4f16_1-metal.so
[2024-10-07 14:37:20] INFO engine_base.py:180: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory.
[2024-10-07 14:37:20] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[2024-10-07 14:37:20] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[14:37:20] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 4096, prefill chunk size will be set to 4096. 
[14:37:20] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 4096, prefill chunk size will be set to 4096. 
[14:37:20] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:688: Under mode "server", max batch size will be set to 128, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 4096. 
[14:37:20] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:769: The actual engine mode is "local". So max batch size is 4, max KV cache token capacity is 4096, prefill chunk size is 4096.
[14:37:20] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:774: Estimated total single GPU memory usage: 7007.820 MB (Parameters: 3790.764 MB. KVCache: 2225.037 MB. Temporary buffer: 992.019 MB). The actual usage might be slightly larger than the estimated number.
[14:37:23] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/runtime/memory/pooled_allocator.h:65: Warning: PooledAllocator got InternalError during allocation: InternalError: Check failed: (buf != nil) is false: 
[14:37:23] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/runtime/memory/pooled_allocator.h:66: Warning: Trying to release all unused memory and reallocate...
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/path/to/project/env/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/path/to/project/env/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
  File "/path/to/project/env/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm.error.InternalError: Traceback (most recent call last):
  File "/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/tvm/src/runtime/metal/metal_device_api.mm", line 194
InternalError: Check failed: (buf != nil) is false: `

plufz · 2024-10-10T13:14:39Z

Any input on this problem would be very helpful for me since I have not managed to solve this myself yet. To begin with, should I interpret this as an out of memory error or just a bug while allocating memory? I have lots of free ram, but maybe I have missed some internal config limit in mlc-llm?

I have now also tried:

downloading a llava model precompiled for mlc from hf and I get the same error on those models (to rule out me making some mistake while compiling the model).
Load the model via mlc chat, which works, I can send a normal text prompt to it, and it responds like a normal text llm. But I don't know if I somehow can make a multimodal request via mlc chat for debugging purposes?
When I try to make a json request with an image and text prompt I get the stack trace above. Both via mlc_llm serve and via the python script above.
I have also started using the same example REST request from the llava issue for the llava PR, to be sure it wasn’t some nuanced bug using a slightly different request. ([Model] [Serve] Add support for LLaVa model in serving engine #1974)

naseeks · 2024-10-11T00:36:02Z

I ran into the same problem as you did!
Generated the complied library of the same model and ran it on both, Apple Metal and a CUDA machine. Both cases got the same error of some memory issues.
Hard to debug when using a library, but I then used instead of the image you have linked above, another much smaller image (2Kb image) and it worked....

but not sure how to take it from here! I dont understand how one image can cause out-of-bounds memory issues or something!

plufz · 2024-10-11T10:51:02Z

Okay, thanks a lot for your reply, atleast I'm not alone.

but not sure how to take it from here! I dont understand how one image can cause out-of-bounds memory issues or something!

Yeah I also find it weird, and the python process does not seem to allocate insane amounts of memory of anything. There is plenty left.

It would be really nice to get any type of answer from a maintainer, just so I know if I should move on to another backend or keep on trying.

naseeks · 2024-10-11T21:18:07Z

Hey plufz!
I think the issue might be that Llava expects an image with dimensions < 336 (=image_size), so in your code, try to resize your image to 336x336 and then recast back to a url type and pass it along... something like this:

response = requests.get(url)
img = Image.open(BytesIO(response.content))
img_resized = img.resize((336, 336))

img_byte_arr = BytesIO()
img_resized.save(img_byte_arr, format="JPEG")
img_byte_arr = img_byte_arr.getvalue()

new_url = (
    f"data:image/jpeg;base64,{base64.b64encode(img_byte_arr).decode('utf-8')}"
)

336 is from the model config:
https://huggingface.co/llava-hf/llava-1.5-7b-hf/blob/main/config.json

plufz · 2024-10-13T20:35:46Z

I think the issue might be that Llava expects an image with dimensions < 336

Thanks a million, that was it.

Is this a bug? Because from what I can see there is a function in the llava model that scales images according to the config image size value. I didn't dig into when this scaling function is called though.

naseeks · 2024-10-14T16:55:25Z

yeah should keep this issue open, until we get a response from any of the Llava maintainers! seems like a bug that needs to be addressed/fixed

plufz added the question Question about the usage label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InternalError when running llava model #2966

InternalError when running llava model #2966

plufz commented Oct 7, 2024 •

edited

Loading

plufz commented Oct 10, 2024

naseeks commented Oct 11, 2024

plufz commented Oct 11, 2024

naseeks commented Oct 11, 2024 •

edited

Loading

plufz commented Oct 13, 2024

naseeks commented Oct 14, 2024

InternalError when running llava model #2966

InternalError when running llava model #2966

Comments

plufz commented Oct 7, 2024 • edited Loading