Eval bug: getting assertion error when trying to use a gguf quantized model at inference "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed" #12080

Vedapani0402 · 2025-02-26T10:41:48Z

Name and Version

the latest version of llama.cpp

Operating systems

Windows

GGML backends

CPU

Hardware

CPU 16 GB RAM Intel I5 core 10th Gen

Models

Flan T5 Large

Problem description & steps to reproduce

I have a finetuned Flan T5 Model in my local which I have quantized and converted to gguf format using llama.cpp using the below line of command:

!python {path to convert_hf_to_gguf.py} {path to hf_model} --outfile {name_of_outputfile.gguf} --outtype {quantization type}

and loaded the gguf file using llama.cpp Llama

from llama_cpp import Llama
gguf_model_path = "t5_8bit.gguf"
model = Llama(model_path=gguf_model_path)

and when trying to use the model at inference in Jupyter Notebook, the kernel is Dying. When tried the same in Command Prompt, getting the aasertion issue "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed"

used the below code for inference in CPU and the issue is detected at model.eval()

Code:
prompt = "Extract Tags and Relevant Text: Please Annotate that the market rates has fallen drastically"

tokens = model.tokenize(prompt.encode())
output_tokens = model.eval(tokens)
output = model.detokenize(tokens)
print(output)

Why is this issue coming, and what is the solution to it. I am trying to use quantized models in the Local for inference

First Bad Commit

No response

Relevant log output

from llama_cpp import Llama 
gguf_model_path = "t5_8bit.gguf"
model = Llama(model_path=gguf_model_path)

prompt = "Extract Tags and Relevant Text: Please Annotate that the market rates has fallen drastically"


tokens = model.tokenize(prompt.encode())
output_tokens = model.eval(tokens)
output = model.detokenize(tokens)
print(output)


output: 
GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed

Vedapani0402 · 2025-02-28T05:07:44Z

I also tried, #8398 (comment) this approach using llama encode but getting the below issue with llama_batch instance. what is llama_batch instance and how do we pass that to the llama_encode functionality:

Code used:

from llama_cpp import llama_encode, llama_model_decoder_start_token, llama_decode
prompt = "Example Text"
llama_encode(512,prompt.encode("utf-8"))

Error:

---->line 4: llama_encode(512,prompt.encode("utf-8"))

ArgumentError: argument 2: TypeError: expected llama_batch instance instead of bytes

Vedapani0402 added the bug-unconfirmed label Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: getting assertion error when trying to use a gguf quantized model at inference "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed" #12080

Eval bug: getting assertion error when trying to use a gguf quantized model at inference "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed" #12080

Vedapani0402 commented Feb 26, 2025 •

edited

Loading

Vedapani0402 commented Feb 28, 2025

Eval bug: getting assertion error when trying to use a gguf quantized model at inference "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed" #12080

Eval bug: getting assertion error when trying to use a gguf quantized model at inference "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed" #12080

Comments

Vedapani0402 commented Feb 26, 2025 • edited Loading

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Vedapani0402 commented Feb 28, 2025

Vedapani0402 commented Feb 26, 2025 •

edited

Loading