Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

[BesTLA] Support Int1 for kernels and models #263

Merged
merged 14 commits into from
May 22, 2024
Merged

[BesTLA] Support Int1 for kernels and models #263

merged 14 commits into from
May 22, 2024

Conversation

luoyu-intel
Copy link
Contributor

@luoyu-intel luoyu-intel commented May 22, 2024

Type of Change

support new weight_dtype=int1

LLaMa2-7B
weight_dtype=int1, group_size=128, alg=sym, scale_dtype=bf16, comp_dtype=int8:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија Хронологија
model_print_timings:        load time =   180.49 ms
model_print_timings:      sample time =     3.36 ms /    16 runs   (    0.21 ms per token)
model_print_timings: prompt eval time =   169.91 ms /    33 tokens (    5.15 ms per token)
model_print_timings:        eval time =   293.41 ms /    15 runs   (   19.56 ms per token)
model_print_timings:       total time =   484.67 ms
========== eval time log of each prediction ==========
prediction   0, time: 169.91ms
prediction   1, time: 19.45ms
prediction   2, time: 19.28ms
prediction   3, time: 19.17ms
prediction   4, time: 19.27ms

weight_dtype=int1, group_size=128, alg=asym, scale_dtype=bf16, comp_dtype=int8:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have funozzáférésozzáférésкола Marcolvàt\<^lánozzáférésodos leak์鳥àtір sens
model_print_timings:        load time =   170.01 ms
model_print_timings:      sample time =     3.19 ms /    16 runs   (    0.20 ms per token)
model_print_timings: prompt eval time =   167.52 ms /    33 tokens (    5.08 ms per token)
model_print_timings:        eval time =   307.53 ms /    15 runs   (   20.50 ms per token)
model_print_timings:       total time =   496.13 ms
========== eval time log of each prediction ==========
prediction   0, time: 167.52ms
prediction   1, time: 20.22ms
prediction   2, time: 20.07ms
prediction   3, time: 20.08ms
prediction   4, time: 19.98ms

int1/int4=0.38

@luoyu-intel luoyu-intel requested a review from zhewang1-intc May 22, 2024 08:05
@luoyu-intel luoyu-intel merged commit 529a1b1 into main May 22, 2024
16 checks passed
@luoyu-intel luoyu-intel deleted the int1 branch May 22, 2024 09:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants