Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: getting assertion error when trying to use a gguf quantized model at inference "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed" #12080

Open
Vedapani0402 opened this issue Feb 26, 2025 · 1 comment

Comments

@Vedapani0402
Copy link

Vedapani0402 commented Feb 26, 2025

Name and Version

the latest version of llama.cpp

Operating systems

Windows

GGML backends

CPU

Hardware

CPU 16 GB RAM Intel I5 core 10th Gen

Models

Flan T5 Large

Problem description & steps to reproduce

I have a finetuned Flan T5 Model in my local which I have quantized and converted to gguf format using llama.cpp using the below line of command:

!python {path to convert_hf_to_gguf.py} {path to hf_model} --outfile {name_of_outputfile.gguf} --outtype {quantization type}

and loaded the gguf file using llama.cpp Llama

from llama_cpp import Llama
gguf_model_path = "t5_8bit.gguf"
model = Llama(model_path=gguf_model_path)

and when trying to use the model at inference in Jupyter Notebook, the kernel is Dying. When tried the same in Command Prompt, getting the aasertion issue "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed"

used the below code for inference in CPU and the issue is detected at model.eval()

Code:
prompt = "Extract Tags and Relevant Text: Please Annotate that the market rates has fallen drastically"

tokens = model.tokenize(prompt.encode())
output_tokens = model.eval(tokens)
output = model.detokenize(tokens)
print(output)

Why is this issue coming, and what is the solution to it. I am trying to use quantized models in the Local for inference

First Bad Commit

No response

Relevant log output

from llama_cpp import Llama 
gguf_model_path = "t5_8bit.gguf"
model = Llama(model_path=gguf_model_path)

prompt = "Extract Tags and Relevant Text: Please Annotate that the market rates has fallen drastically"


tokens = model.tokenize(prompt.encode())
output_tokens = model.eval(tokens)
output = model.detokenize(tokens)
print(output)


output: 
GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed
@Vedapani0402
Copy link
Author

I also tried, #8398 (comment) this approach using llama encode but getting the below issue with llama_batch instance. what is llama_batch instance and how do we pass that to the llama_encode functionality:

Code used:

from llama_cpp import llama_encode, llama_model_decoder_start_token, llama_decode
prompt = "Example Text"
llama_encode(512,prompt.encode("utf-8"))

Error:

---->line 4: llama_encode(512,prompt.encode("utf-8"))

ArgumentError: argument 2: TypeError: expected llama_batch instance instead of bytes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant