Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] InternVL2-2B的推理速度慢,发现是视觉特征提取的耗时很长 #2604

Open
3 tasks
fong-git opened this issue Oct 15, 2024 · 9 comments
Open
3 tasks
Assignees

Comments

@fong-git
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用Transformer、vllm、LMdeploy对InternVL2-2B进行推理,max_num_patch都设置为12,推理结果发现:
Transformer平均691ms/条
VLLM平均308ms/条
LMdeploy平均523ms/条
对VLLM和LMdeploy耗时进行分析发现,vllm的vit部分平均耗时9ms,LMdeploy的vit部分平均耗时323ms。
lmdeploy
LMdeploy的vit统计时间在VLAsyncEngine类的_get_prompt_input中统计``

Reproduction

from lmdeploy import pipeline, TurbomindEngineConfig, PytorchEngineConfig
from lmdeploy.vl import load_image
import os
import time

PROMPT_SYSTEM = """
根据图片,判断该文档所属的文档类别。 请严格按照如下的格式进行回复,不要输出多余的解释(注意不要强行给文档分一个不正确的类别:对于不属于特定类别的文档,判别为‘其他文档’):
文档类别:该文档所属的文档类别
"""

model = 'model/OpenGVLab/InternVL2-1B'

model = 'model/OpenGVLab/InternVL2-2B'
pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192,model_format='hf'))
img_path = './cs_function_recommendation_bak/test_data/image'
imgs = os.listdir(img_path)
totle_time = 0
vit_time_total =0
for img in imgs[:100]:
image = load_image(os.path.join(img_path,img))

start = time.time()
response = pipe((PROMPT_SYSTEM, image))
end = time.time()
time_ = end - start
totle_time += time_
vit_time_total += response.vit_time
print(response.text,f"\nvit 时间:{response.vit_time},总耗时:{time_}")

print(vit_time_total)
print(totle_time)

Environment

sys.platform: linux
Python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7,8,9: NVIDIA A100 80GB PCIe
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.3.1+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.18.1+cu121
LMDeploy: 0.5.3+aa00ed0
transformers: 4.45.0.dev0
gradio: Not Found
fastapi: 0.115.0
pydantic: 2.9.2
triton: 2.3.1

Error traceback

No response

@sjzhou4
Copy link

sjzhou4 commented Oct 23, 2024

嗨,我也发现了类似的问题,我这边简单分析了下,lmdeploy比vllm在vit部分平均增加的耗时是由于lmdeploy需要将vit的feature遍历从gpu到cpu,也就是下面图中的x.cpu()引起的。vllm不需要此环节

image

@fong-git
Copy link
Author

fong-git commented Oct 23, 2024 via email

@sjzhou4
Copy link

sjzhou4 commented Oct 23, 2024

我这边注释掉之后基本上就少了对应的to cpu的时间,而且vit整体的性能我测试来看lmdeploy和vllm是近似的(去掉to cpu的话)。
image

image

@sjzhou4
Copy link

sjzhou4 commented Oct 23, 2024

但是这里虽然是去掉了to cpu时间,后面还是会进行gpu到cpu同步的,lmdeploy逻辑是这样实现的

@fong-git
Copy link
Author

但是这里虽然是去掉了to cpu时间,后面还是会进行gpu到cpu同步的,lmdeploy逻辑是这样实现的

所以是lmdeploy这里即便注释掉tocpu()了,后面还是会进行GPU到CPU的同步是吗,就是整体的时间哈还是不会减少吗

@irexyc
Copy link
Collaborator

irexyc commented Oct 23, 2024

@fong-git

我不确定你是怎么统计的时间,比较准确的方式是去除预处理时间,然后 vision model forward 前后对 stream 进行同步。下面是我之前测的两个vision模型 forward 的时间。

image

lmdeploy 没有对 vision 模型做 tp,所以 tp 对 lmdeploy 的 vision部分没有收益。大 batch tp 下会比 vllm 慢一些,不过现在 vision 模型都比较大,显存不一定支持跑那么大的 batch。

@fong-git @sjzhou4

关于 to cpu 的问题,之前 pytorch backend 遇到一个问题,就是如果不做 to cpu 的话,得到的图片特征结果会不正确,这可能跟 vision模型跑在单独的线程有关系 (asyncio executor)。to cpu 对单个请求的时延会有影响,对整体吞吐应该没影响,因为并不会阻塞请求。

另外感觉 to cpu 其实省略不了,因为如果后面要支持 prefix caching的话,是要保存一定数量的图片特征的,这样可以避免在对话过程中重复提取特征,而因为显存的原因,特征存在内存中是一个比较好的方式。

@fong-git
Copy link
Author

@irexyc 我测了vision model的单纯GPU计算feature的时间和vllm是差不多的,但是在VLAsyncEngine类的_get_prompt_input中统计features = await self.vl_encoder.async_infer(images)的时间会比vllm慢很多,导致实际测下来的推理速度比vllm慢

@Dimensionzw
Copy link

@fong-git 我这边实测在tp均为4的情况下,lmdeploy比vllm慢500ms左右,feature推理时间基本一致,问题就在于to cpu这部分,vllm是直接把GPU 的torch tensor传入后续流程的:

def merge_multimodal_embeddings(input_ids: torch.Tensor,
                                inputs_embeds: torch.Tensor,
                                multimodal_embeddings: NestedTensors,
                                placeholder_token_id: int) -> torch.Tensor:
    """
    Merge ``multimodal_embeddings`` into ``inputs_embeds`` by overwriting the
    positions in ``inputs_embeds`` corresponding to placeholder tokens in
    ``input_ids``.

    Note:
        This updates ``inputs_embeds`` in place.
    """
    mask = (input_ids == placeholder_token_id)
    num_expected_tokens = mask.sum().item()
    assert isinstance(num_expected_tokens, int)

    flattened = _flatten_embeddings(multimodal_embeddings)
    if flattened.shape[0] != num_expected_tokens:
        expr = _embedding_count_expression(multimodal_embeddings)
        raise ValueError(
            f"Attempted to assign {expr} = {flattened.shape[0]} "
            f"multimodal tokens to {num_expected_tokens} placeholders")

    inputs_embeds[mask] = flattened
    return inputs_embeds

@sjzhou4
Copy link

sjzhou4 commented Oct 24, 2024

@irexyc @fong-git 是的,lmdeploy的to cpu 是不能缺少的,就像 @Dimensionzw 说的那样,lmdeploy和vllm的架构设计有区别的,lmdeploy更多的是在上层进行模版拼接、特征提取等工作,最后把这些inputs信息传递到turomind backend端进行处理。而且像 @irexyc 说的,后面的prefix caching等,也可能会使用内存、甚至硬盘来存储特征信息,进而优化显存的占用,这些都是需要to cpu操作的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants