You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Eval bug: getting assertion error when trying to use a gguf quantized model at inference "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed"
#12080
Open
Vedapani0402 opened this issue
Feb 26, 2025
· 1 comment
I have a finetuned Flan T5 Model in my local which I have quantized and converted to gguf format using llama.cpp using the below line of command:
!python {path to convert_hf_to_gguf.py} {path to hf_model} --outfile {name_of_outputfile.gguf} --outtype {quantization type}
and loaded the gguf file using llama.cpp Llama
from llama_cpp import Llama
gguf_model_path = "t5_8bit.gguf"
model = Llama(model_path=gguf_model_path)
and when trying to use the model at inference in Jupyter Notebook, the kernel is Dying. When tried the same in Command Prompt, getting the aasertion issue "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed"
used the below code for inference in CPU and the issue is detected at model.eval()
Code:
prompt = "Extract Tags and Relevant Text: Please Annotate that the market rates has fallen drastically"
I also tried, #8398 (comment) this approach using llama encode but getting the below issue with llama_batch instance. what is llama_batch instance and how do we pass that to the llama_encode functionality:
Name and Version
the latest version of llama.cpp
Operating systems
Windows
GGML backends
CPU
Hardware
CPU 16 GB RAM Intel I5 core 10th Gen
Models
Flan T5 Large
Problem description & steps to reproduce
I have a finetuned Flan T5 Model in my local which I have quantized and converted to gguf format using llama.cpp using the below line of command:
!python {path to convert_hf_to_gguf.py} {path to hf_model} --outfile {name_of_outputfile.gguf} --outtype {quantization type}
and loaded the gguf file using llama.cpp Llama
from llama_cpp import Llama
gguf_model_path = "t5_8bit.gguf"
model = Llama(model_path=gguf_model_path)
and when trying to use the model at inference in Jupyter Notebook, the kernel is Dying. When tried the same in Command Prompt, getting the aasertion issue "GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed"
used the below code for inference in CPU and the issue is detected at model.eval()
Code:
prompt = "Extract Tags and Relevant Text: Please Annotate that the market rates has fallen drastically"
tokens = model.tokenize(prompt.encode())
output_tokens = model.eval(tokens)
output = model.detokenize(tokens)
print(output)
Why is this issue coming, and what is the solution to it. I am trying to use quantized models in the Local for inference
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: