Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: In RISC-V, output tokens are broken #12124

Open
op21beyond opened this issue Mar 1, 2025 · 0 comments
Open

Eval bug: In RISC-V, output tokens are broken #12124

op21beyond opened this issue Mar 1, 2025 · 0 comments

Comments

@op21beyond
Copy link

op21beyond commented Mar 1, 2025

  1. L1 Prefetcher : Distance & Degree => Dynamically adaptoble

Name and Version

sh-3.2# llama-cli --version
version: 4758 (5fa07c2)
built with riscv64-tizen-linux-gnu-gcc (Tizen/RISC-V/imafdcv/Standalone-20230621) 13.1.0 for riscv64-tizen-linux-gnu

Operating systems

Linux

GGML backends

CPU

Hardware

RISC-V ISA Simulator (https://github.com/OpenXiangShan/NEMU)

Models

Errors happen not in specific models, but most of the models I tested which includes below;
DeepSeek-R1-Distill-Qwen-1.5B-Q8_0.gguf
llama-3.2-1b-instruct-q8_0.gguf
nano-mistral-q4_0.gguf
tiny-llm-q8_0.gguf

Problem description & steps to reproduce

When I run llama-simple (or llama-cli) build with __riscv_v_instrinsic flags(default for llama.cpp RISCV cross compile), the generated tokens are broken like these;
e.g)
llama-simple -m tiny-llm-q8_0.gguf
(output)
Hello my name is.,etsperled.raHeperrical
plantaping]
plantaping]
plantaping]
plantaping]
plantcluding")cketsming

main: decoded 32 tokens in 0.70 s, speed: 46.00 t/s

If I run non rvv version of llama-simple (or llama-cli) built without __riscv_v_instrinsic flags(with -U__riscv_v_intrinsic), the generated tokens are not broken. I suspect there might be some bug in riscv rvv intrinsic code in ggml.
e.g)
llama-simple -m tiny-llm-q8_0.gguf
(output)
Hello my name is so much more than I am, I am so happy to be able to get a new one.
I am a new one. I am a new one
main: decoded 32 tokens in 0.77 s, speed: 41.80 t/s

First Bad Commit

No response

Relevant log output

llama_model_loader: loaded meta data with 32 key-value pairs and 12 tensors from ./model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Tiny LLM
llama_model_loader: - kv   3:                         general.size_label str              = 13M
llama_model_loader: - kv   4:                            general.license str              = mit
llama_model_loader: - kv   5:                      general.dataset.count u32              = 1
llama_model_loader: - kv   6:                     general.dataset.0.name str              = Fineweb
llama_model_loader: - kv   7:             general.dataset.0.organization str              = HuggingFaceFW
llama_model_loader: - kv   8:                 general.dataset.0.repo_url str              = https://huggingface.co/HuggingFaceFW/...
llama_model_loader: - kv   9:                               general.tags arr[str,1]       = ["text-generation"]
llama_model_loader: - kv  10:                          llama.block_count u32              = 1
llama_model_loader: - kv  11:                       llama.context_length u32              = 1024
llama_model_loader: - kv  12:                     llama.embedding_length u32              = 192
llama_model_loader: - kv  13:                  llama.feed_forward_length u32              = 1024
llama_model_loader: - kv  14:                 llama.attention.head_count u32              = 2
llama_model_loader: - kv  15:              llama.attention.head_count_kv u32              = 1
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  18:                 llama.rope.dimension_count u32              = 96
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,32000]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,32000]   = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  26:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  27:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  28:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  29:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                          general.file_type u32              = 7
llama_model_loader: - type  f32:    3 tensors
llama_model_loader: - type q8_0:    9 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 13.16 MiB (8.50 BPW)
init_tokenizer: initializing tokenizer for type 1
load: control token:      0 '<unk>' is not marked as EOG
load: control token:      2 '</s>' is not marked as EOG
load: control token:      1 '<s>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1684 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 1024
print_info: n_embd           = 192
print_info: n_layer          = 1
print_info: n_head           = 2
print_info: n_head_kv        = 1
print_info: n_rot            = 96
print_info: n_swa            = 0
print_info: n_embd_head_k    = 96
print_info: n_embd_head_v    = 96
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 96
print_info: n_embd_v_gqa     = 96
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 1024
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 1024
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = ?B
print_info: model params     = 12.99 M
print_info: general.name     = Tiny LLM
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU
load_tensors: layer   1 assigned to device CPU
load_tensors: tensor 'token_embd.weight' (q8_0) (and 11 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
load_tensors:   CPU_Mapped model buffer size =    13.16 MiB
......
llama_init_from_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 64
llama_init_from_model: n_ctx_per_seq = 64
llama_init_from_model: n_batch       = 64
llama_init_from_model: n_ubatch      = 64
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (64) < n_ctx_train (1024) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 64, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 1, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 96, n_embd_v_gqa = 96
llama_kv_cache_init:        CPU KV buffer size =     0.02 MiB
llama_init_from_model: KV self size  =    0.02 MiB, K (f16):    0.01 MiB, V (f16):    0.01 MiB
llama_init_from_model:        CPU  output buffer size =     0.12 MiB
llama_init_from_model:        CPU compute buffer size =     7.86 MiB
llama_init_from_model: graph nodes  = 38
llama_init_from_model: graph splits = 1
<s> Hello my name is.,etsperled.raHeperrical
plantaping]
plantaping]
plantaping]
plantaping]
plantcluding")cketsming
main: decoded 32 tokens in 0.70 s, speed: 46.00 t/s

llama_perf_sampler_print:    sampling time =       1.82 ms /    32 runs   (    0.06 ms per token, 17621.15 tokens per second)
llama_perf_context_print:        load time =     227.70 ms
llama_perf_context_print: prompt eval time =      22.88 ms /     5 tokens (    4.58 ms per token,   218.52 tokens per second)
llama_perf_context_print:        eval time =     641.11 ms /    31 runs   (   20.68 ms per token,    48.35 tokens per second)
llama_perf_context_print:       total time =     900.52 ms /    36 tokens

[VD-XS] src/profiling/profiling_control.c reset_inst_counters 50 Have taken checkpoint at 3659994548 guest instructions (abs_inst_count 3659994760)[VD-XS] /home/jongchul/VD-XS/NEMU/src/isa/riscv64/include/../instr/special.h execute 53 trap 0x0 cpu.pc = 0x104a4 s.pc = 0x104b6

real    0m59.748s
user    0m58.636s
sys     0m1.104s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant