Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup problem with GPTQModel #90

Open
ChenMnZ opened this issue Jul 19, 2024 · 10 comments
Open

Speedup problem with GPTQModel #90

ChenMnZ opened this issue Jul 19, 2024 · 10 comments

Comments

@ChenMnZ
Copy link

ChenMnZ commented Jul 19, 2024

Hi

I test bitblas models with the https://github.com/ModelCloud/GPTQModel repo.

I found that the output is correct. However, BitBLAS obtains similar token generation speed in low-bits (2-bit and 4-bit) model with FP16 model. Detailed results are as follow:
image

the corresponding test code is:

import torch
from transformers import AutoTokenizer
from gptqmodel import GPTQModel, QuantizeConfig, get_backend

import time

def main():
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("--model", default=None, type=str, help="direction for saving quantization model")
    parser.add_argument("--wbits", type=int, default=4, help="quantization bits")
    parser.add_argument("--group_size", type=int, default=128, help="quantization group size")
    parser.add_argument("--test_speed", action="store_true")

    


    args = parser.parse_args()
    tokenizer = AutoTokenizer.from_pretrained(args.model, use_fast=False,legacy=False)
    model = GPTQModel.from_quantized(args.model, device_map='auto',torch_dtype=torch.float16,backend=get_backend('BITBLAS'))
    model.cuda()
    print(f"memory footprint after loading quantized model: {torch.cuda.max_memory_allocated('cuda') / 1024**3:.2f}GiB")


    if args.test_speed:
        prompt = "Write a poem about large language model:"
        input_ids = tokenizer(prompt, return_tensors='pt').input_ids.cuda()
        start_time = time.time()
        output = model.generate(inputs=input_ids, do_sample=True, top_k=10, max_new_tokens=256)
        end_time = time.time()
        speed = len(output[0])/(end_time-start_time)
        print(tokenizer.decode(output[0]))
        print(f"generation speed:{speed}token/s")
        

if __name__ =='__main__':
    main()

Do you know what is the potential problem to hinder speedup. Thank you.

@LeiWang1999
Copy link
Contributor

Hi @ChenMnZ , would you mind provide the reproduce scripts of the triton v2 backend? :)

@ChenMnZ
Copy link
Author

ChenMnZ commented Jul 19, 2024

@LeiWang1999
Thanks for the quik reply. For the testing of triton v2, just replace model loading manner from

GPTQModel.from_quantized(args.model, device_map='auto',torch_dtype=torch.float16,backend=get_backend('BITBLAS'))

to

GPTQModel.from_quantized(args.model, device_map='auto',torch_dtype=torch.float16)

args.model should be the path of a standard GPTQ packed model. And the code will automatically choose the triton v2 kernel for 2-bit quantization.

@LeiWang1999
Copy link
Contributor

@ChenMnZ Thanks, that's interesting, I‘ll take a look.

@ChenMnZ
Copy link
Author

ChenMnZ commented Aug 31, 2024

Hi, @LeiWang1999
Have you found a solution to this inference speed problem.

@LeiWang1999
Copy link
Contributor

hi @ChenMnZ , can you provide huggingface model repos for us to reproduce?

@ChenMnZ
Copy link
Author

ChenMnZ commented Aug 31, 2024

@LeiWang1999
Copy link
Contributor

hi @ChenMnZ , have you met this error when loading python gptq.py --model ChenMnZ/Llama-3-8b-instruct-EfficientQAT-w4g128-BitBLAS --test_speed

Traceback (most recent call last):
  File "/root/BitBLAS/debug/gptq.py", line 35, in <module>
    main()
  File "/root/BitBLAS/debug/gptq.py", line 27, in main
    output = model.generate(inputs=input_ids, do_sample=True, top_k=10, max_new_tokens=256)
  File "/opt/conda/lib/python3.10/site-packages/gptqmodel-0.9.3.dev0+cu1201010-py3.10-linux-x86_64.egg/gptqmodel/models/base.py", line 466, in generate
    return self.model.generate(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2024, in generate
    result = self._sample(
  File "/opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 3020, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

@ChenMnZ
Copy link
Author

ChenMnZ commented Aug 31, 2024

@LeiWang1999
Sorry for the misleading.

Replace the model.cuda() to model.model.cuda() in code can solve this problem.

@w32zhong
Copy link

@ChenMnZ what GPU(s) are you running for the experiments?

@ChenMnZ
Copy link
Author

ChenMnZ commented Sep 12, 2024

@w32zhong Nvidia-A100 80GB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants