Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

[BesTLA] Support int5&int6 for kernels and models #259

Merged
merged 21 commits into from
May 21, 2024
Merged

[BesTLA] Support int5&int6 for kernels and models #259

merged 21 commits into from
May 21, 2024

Conversation

luoyu-intel
Copy link
Contributor

@luoyu-intel luoyu-intel commented May 20, 2024

Type of Change

Add new weight_dtype: int5 and int6
Support model quantization of int5 and int6

bestla/bestla/bestla_prologue_b.h Fixed Show fixed Hide fixed
bestla/bestla/bestla_prologue_b.h Fixed Show fixed Hide fixed
bestla/bestla/bestla_prologue_b.h Fixed Show fixed Hide fixed
bestla/bestla/bestla_prologue_b.h Fixed Show fixed Hide fixed
@luoyu-intel luoyu-intel marked this pull request as ready for review May 20, 2024 07:21
@luoyu-intel
Copy link
Contributor Author

LLaMa2-7B
weight_dtype=int5, group_size=128, alg=sym, scale_dtype=bf16, comp_dtype=int8:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. But her parents were always worried about her safety, and forbade her
model_print_timings:        load time =   205.45 ms
model_print_timings:      sample time =     3.88 ms /    16 runs   (    0.24 ms per token)
model_print_timings: prompt eval time =   191.15 ms /    33 tokens (    5.79 ms per token)
model_print_timings:        eval time =   948.45 ms /    15 runs   (   63.23 ms per token)
model_print_timings:       total time =  1164.44 ms
========== eval time log of each prediction ==========
prediction   0, time: 191.15ms
prediction   1, time: 63.11ms
prediction   2, time: 62.90ms
prediction   3, time: 62.90ms
prediction   4, time: 62.94ms

weight_dtype=int5, group_size=128, alg=asym, scale_dtype=bf16, comp_dtype=int8:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. But her parents were too busy with their work, and they didn't
model_print_timings:        load time =   208.64 ms
model_print_timings:      sample time =     4.40 ms /    16 runs   (    0.28 ms per token)
model_print_timings: prompt eval time =   190.07 ms /    33 tokens (    5.76 ms per token)
model_print_timings:        eval time =   964.22 ms /    15 runs   (   64.28 ms per token)
model_print_timings:       total time =  1183.92 ms
========== eval time log of each prediction ==========
prediction   0, time: 190.07ms
prediction   1, time: 64.27ms
prediction   2, time: 63.99ms
prediction   3, time: 63.92ms
prediction   4, time: 64.02ms

@luoyu-intel
Copy link
Contributor Author

LLaMa2-7B
weight_dtype=int6, group_size=128, alg=sym, scale_dtype=bf16, comp_dtype=int8:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun with her friends. However, she had never gone on an adventure before,
model_print_timings:        load time =   406.25 ms
model_print_timings:      sample time =     4.29 ms /    16 runs   (    0.27 ms per token)
model_print_timings: prompt eval time =   200.18 ms /    33 tokens (    6.07 ms per token)
model_print_timings:        eval time =  1129.39 ms /    15 runs   (   75.29 ms per token)
model_print_timings:       total time =  1544.76 ms
========== eval time log of each prediction ==========
prediction   0, time: 200.18ms
prediction   1, time: 75.18ms
prediction   2, time: 74.97ms
prediction   3, time: 75.01ms
prediction   4, time: 74.91ms

weight_dtype=int6, group_size=128, alg=asym, scale_dtype=bf16, comp_dtype=int8:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun while doing it. This young girl was full of curiosity, always eager to explore
model_print_timings:        load time =   204.77 ms
model_print_timings:      sample time =     3.94 ms /    16 runs   (    0.25 ms per token)
model_print_timings: prompt eval time =   199.53 ms /    33 tokens (    6.05 ms per token)
model_print_timings:        eval time =  1146.44 ms /    15 runs   (   76.43 ms per token)
model_print_timings:       total time =  1361.33 ms
========== eval time log of each prediction ==========
prediction   0, time: 199.54ms
prediction   1, time: 76.22ms
prediction   2, time: 76.11ms
prediction   3, time: 76.17ms
prediction   4, time: 76.01ms

@luoyu-intel luoyu-intel requested a review from kevinintel May 21, 2024 03:34
@kevinintel
Copy link
Contributor

please update supported data types in https://github.com/intel/neural-speed/blob/main/docs/advanced_usage.md

@luoyu-intel
Copy link
Contributor Author

please update supported data types in https://github.com/intel/neural-speed/blob/main/docs/advanced_usage.md

Added

@luoyu-intel
Copy link
Contributor Author

int4 reference
sym:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun in the process.
Unfortunately, her parents were quite against this kind of
model_print_timings:        load time =  5127.59 ms
model_print_timings:      sample time =     3.94 ms /    16 runs   (    0.25 ms per token)
model_print_timings: prompt eval time =   183.49 ms /    33 tokens (    5.56 ms per token)
model_print_timings:        eval time =   764.45 ms /    15 runs   (   50.96 ms per token)
model_print_timings:       total time =  5900.96 ms
========== eval time log of each prediction ==========
prediction   0, time: 183.49ms
prediction   1, time: 50.81ms
prediction   2, time: 50.48ms
prediction   3, time: 50.52ms
prediction   4, time: 50.51ms

asym:

Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun doing so. However, her parents were not very keen on the idea of letting
model_print_timings:        load time =   196.11 ms
model_print_timings:      sample time =     3.47 ms /    16 runs   (    0.22 ms per token)
model_print_timings: prompt eval time =   180.31 ms /    33 tokens (    5.46 ms per token)
model_print_timings:        eval time =   780.00 ms /    15 runs   (   52.00 ms per token)
model_print_timings:       total time =   986.11 ms
========== eval time log of each prediction ==========
prediction   0, time: 180.31ms
prediction   1, time: 51.77ms
prediction   2, time: 51.58ms
prediction   3, time: 51.55ms
prediction   4, time: 51.50ms

int5/int4 = 1.23
int6/int4 = 1.47

@luoyu-intel
Copy link
Contributor Author

LLaMa2-7B
weight_dtype=int5, group_size=128, alg=sym, scale_dtype=bf16, comp_dtype=int8:

Tasks Version Filter n-shot Metric Value Stderr
winogrande 1 none 0 acc 0.6875 ± 0.0130
piqa 1 none 0 acc 0.7622 ± 0.0099
none 0 acc_norm 0.7715 ± 0.0098
lambada_openai 1 none 0 perplexity 3.3888 ± 0.0913
none 0 acc 0.7002 ± 0.0064

weight_dtype=int6, group_size=128, alg=sym, scale_dtype=bf16, comp_dtype=int8:

Tasks Version Filter n-shot Metric Value Stderr
winogrande 1 none 0 acc 0.6827 ± 0.0131
piqa 1 none 0 acc 0.7633 ± 0.0099
none 0 acc_norm 0.7671 ± 0.0099
lambada_openai 1 none 0 perplexity 3.3843 ± 0.0912
none 0 acc 0.7000 ± 0.0064

@luoyu-intel luoyu-intel merged commit 68d2cff into main May 21, 2024
16 checks passed
@luoyu-intel luoyu-intel deleted the int56 branch May 21, 2024 07:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants