[BesTLA] Support int5&int6 for kernels and models #259

luoyu-intel · 2024-05-20T05:05:45Z

Type of Change

Add new weight_dtype: int5 and int6
Support model quantization of int5 and int6

bestla/bestla/bestla_prologue_b.h

luoyu-intel · 2024-05-20T07:36:20Z

LLaMa2-7B
weight_dtype=int5, group_size=128, alg=sym, scale_dtype=bf16, comp_dtype=int8:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. But her parents were always worried about her safety, and forbade her
model_print_timings:        load time =   205.45 ms
model_print_timings:      sample time =     3.88 ms /    16 runs   (    0.24 ms per token)
model_print_timings: prompt eval time =   191.15 ms /    33 tokens (    5.79 ms per token)
model_print_timings:        eval time =   948.45 ms /    15 runs   (   63.23 ms per token)
model_print_timings:       total time =  1164.44 ms
========== eval time log of each prediction ==========
prediction   0, time: 191.15ms
prediction   1, time: 63.11ms
prediction   2, time: 62.90ms
prediction   3, time: 62.90ms
prediction   4, time: 62.94ms

weight_dtype=int5, group_size=128, alg=asym, scale_dtype=bf16, comp_dtype=int8:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. But her parents were too busy with their work, and they didn't
model_print_timings:        load time =   208.64 ms
model_print_timings:      sample time =     4.40 ms /    16 runs   (    0.28 ms per token)
model_print_timings: prompt eval time =   190.07 ms /    33 tokens (    5.76 ms per token)
model_print_timings:        eval time =   964.22 ms /    15 runs   (   64.28 ms per token)
model_print_timings:       total time =  1183.92 ms
========== eval time log of each prediction ==========
prediction   0, time: 190.07ms
prediction   1, time: 64.27ms
prediction   2, time: 63.99ms
prediction   3, time: 63.92ms
prediction   4, time: 64.02ms

luoyu-intel · 2024-05-21T01:50:47Z

LLaMa2-7B
weight_dtype=int6, group_size=128, alg=sym, scale_dtype=bf16, comp_dtype=int8:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun with her friends. However, she had never gone on an adventure before,
model_print_timings:        load time =   406.25 ms
model_print_timings:      sample time =     4.29 ms /    16 runs   (    0.27 ms per token)
model_print_timings: prompt eval time =   200.18 ms /    33 tokens (    6.07 ms per token)
model_print_timings:        eval time =  1129.39 ms /    15 runs   (   75.29 ms per token)
model_print_timings:       total time =  1544.76 ms
========== eval time log of each prediction ==========
prediction   0, time: 200.18ms
prediction   1, time: 75.18ms
prediction   2, time: 74.97ms
prediction   3, time: 75.01ms
prediction   4, time: 74.91ms

weight_dtype=int6, group_size=128, alg=asym, scale_dtype=bf16, comp_dtype=int8:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun while doing it. This young girl was full of curiosity, always eager to explore
model_print_timings:        load time =   204.77 ms
model_print_timings:      sample time =     3.94 ms /    16 runs   (    0.25 ms per token)
model_print_timings: prompt eval time =   199.53 ms /    33 tokens (    6.05 ms per token)
model_print_timings:        eval time =  1146.44 ms /    15 runs   (   76.43 ms per token)
model_print_timings:       total time =  1361.33 ms
========== eval time log of each prediction ==========
prediction   0, time: 199.54ms
prediction   1, time: 76.22ms
prediction   2, time: 76.11ms
prediction   3, time: 76.17ms
prediction   4, time: 76.01ms

kevinintel · 2024-05-21T05:24:28Z

please update supported data types in https://github.com/intel/neural-speed/blob/main/docs/advanced_usage.md

luoyu-intel · 2024-05-21T05:35:05Z

please update supported data types in https://github.com/intel/neural-speed/blob/main/docs/advanced_usage.md

Added

luoyu-intel · 2024-05-21T06:28:57Z

int4 reference
sym:

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun in the process.
Unfortunately, her parents were quite against this kind of
model_print_timings:        load time =  5127.59 ms
model_print_timings:      sample time =     3.94 ms /    16 runs   (    0.25 ms per token)
model_print_timings: prompt eval time =   183.49 ms /    33 tokens (    5.56 ms per token)
model_print_timings:        eval time =   764.45 ms /    15 runs   (   50.96 ms per token)
model_print_timings:       total time =  5900.96 ms
========== eval time log of each prediction ==========
prediction   0, time: 183.49ms
prediction   1, time: 50.81ms
prediction   2, time: 50.48ms
prediction   3, time: 50.52ms
prediction   4, time: 50.51ms

asym:

Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun doing so. However, her parents were not very keen on the idea of letting
model_print_timings:        load time =   196.11 ms
model_print_timings:      sample time =     3.47 ms /    16 runs   (    0.22 ms per token)
model_print_timings: prompt eval time =   180.31 ms /    33 tokens (    5.46 ms per token)
model_print_timings:        eval time =   780.00 ms /    15 runs   (   52.00 ms per token)
model_print_timings:       total time =   986.11 ms
========== eval time log of each prediction ==========
prediction   0, time: 180.31ms
prediction   1, time: 51.77ms
prediction   2, time: 51.58ms
prediction   3, time: 51.55ms
prediction   4, time: 51.50ms

int5/int4 = 1.23
int6/int4 = 1.47

luoyu-intel · 2024-05-21T07:08:50Z

LLaMa2-7B
weight_dtype=int5, group_size=128, alg=sym, scale_dtype=bf16, comp_dtype=int8:

Tasks	Version	Filter	Metric	Value		Stderr
winogrande	1	none	acc	0.6875	±	0.0130
piqa	1	none	acc	0.7622	±	0.0099
		none	acc_norm	0.7715	±	0.0098
lambada_openai	1	none	perplexity	3.3888	±	0.0913
		none	acc	0.7002	±	0.0064

weight_dtype=int6, group_size=128, alg=sym, scale_dtype=bf16, comp_dtype=int8:

Tasks	Version	Filter	Metric	Value		Stderr
winogrande	1	none	acc	0.6827	±	0.0131
piqa	1	none	acc	0.7633	±	0.0099
		none	acc_norm	0.7671	±	0.0099
lambada_openai	1	none	perplexity	3.3843	±	0.0912
		none	acc	0.7000	±	0.0064

add initial of int5

ee2bd8a

github-advanced-security bot found potential problems May 20, 2024

View reviewed changes

bestla/bestla/bestla_prologue_b.h Fixed Show fixed Hide fixed

bestla/bestla/bestla_prologue_b.h Fixed Show fixed Hide fixed

bestla/bestla/bestla_prologue_b.h Fixed Show fixed Hide fixed

bestla/bestla/bestla_prologue_b.h Fixed Show fixed Hide fixed

luoyu-intel added 8 commits May 20, 2024 14:04

add all gemv of int5

3e4096e

finish all avx2 kernels of int5

d666f36

add benchmark of int5

24423bb

add avx512f s5_s8, s5_fp

66674df

add avx512f kernels of int5

fc5da69

test LLaMa2-7B with int5, sym and asym.

e1c3670

fix code scan

91caf36

clang-format

99ff3f0

luoyu-intel marked this pull request as ready for review May 20, 2024 07:21

luoyu-intel added 3 commits May 20, 2024 16:33

add avx2 decompress kernels of int6

2ceb516

add avx2 gemv kernels for int6

8d68b5e

add avx512f kernels for int6

814f4a2

luoyu-intel added 5 commits May 21, 2024 09:53

clang-format

3822687

fix UT

451aa2d

fix UT bug

9752312

add UTs for new bits

5f98ab8

update doc

e157ac8

luoyu-intel requested review from zhewang1-intc and yuchengliu1 May 21, 2024 03:14

fix UT bug

5c1bc85

luoyu-intel added ready to review Ready to review BesTLA labels May 21, 2024

fix ISA check

5d8bac9

luoyu-intel requested a review from kevinintel May 21, 2024 03:34

fix bug of AVX2 s6_s8

6dd58f5

update dtypes in advanced_usage.md

5b98a4b

yuchengliu1 approved these changes May 21, 2024

View reviewed changes

zhewang1-intc approved these changes May 21, 2024

View reviewed changes

luoyu-intel merged commit 68d2cff into main May 21, 2024
16 checks passed

luoyu-intel deleted the int56 branch May 21, 2024 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BesTLA] Support int5&int6 for kernels and models #259

[BesTLA] Support int5&int6 for kernels and models #259

luoyu-intel commented May 20, 2024 •

edited

Loading

luoyu-intel commented May 20, 2024

luoyu-intel commented May 21, 2024

kevinintel commented May 21, 2024

luoyu-intel commented May 21, 2024

luoyu-intel commented May 21, 2024

luoyu-intel commented May 21, 2024

[BesTLA] Support int5&int6 for kernels and models #259

[BesTLA] Support int5&int6 for kernels and models #259

Conversation

luoyu-intel commented May 20, 2024 • edited Loading

Type of Change

luoyu-intel commented May 20, 2024

luoyu-intel commented May 21, 2024

kevinintel commented May 21, 2024

luoyu-intel commented May 21, 2024

luoyu-intel commented May 21, 2024

luoyu-intel commented May 21, 2024

luoyu-intel commented May 20, 2024 •

edited

Loading