-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Torch][WeightCompression] Add Scale Estimation data-aware support #3179
base: develop
Are you sure you want to change the base?
Conversation
weight compression build - 291 |
The proposed example can be added as a follow-up PR - will be excluded from this PR |
tests/cross_fw/test_templates/template_test_weights_compression.py
Outdated
Show resolved
Hide resolved
def _reduce_out_of_place(self, x: List[Tensor]) -> List[Tensor]: | ||
x = x[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR.
Should this function really take the list?
Seems like it always contains one element. Is it only RawReducer
that consumes the whole list and the rest works with one element?
Can we inherit all these "one-element" classes from a class that defines a method with a single element?
IMO, it would be more clear when one element expected and when not.
@daniil-lyakhov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reducers was designed to receive several inputs and output several outputs as well. For now, there is no such reducers, but the possible use case - quantization error (fp32 input, int8 input) -> diff
We can create a class for that, it will make hierarchy tree more complicated, we than should introduce method like _reduce_out_of_place_one_input
method, and I don't think this will make code more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all reducers use one input, it should be relatively easy, isn't?
- keep in base class
_reduce_out_of_place
for many inputs - subclass an intermediate class with
_reduce_out_of_place_one_input
- rename inheritance from base class to intermediate class
- rename
_reduce_out_of_place
to_reduce_out_of_place_one_input
.
@kshpv Have you run the performance job after merging with develop? Could you share it? |
Sure, performance build 41 |
tests/cross_fw/test_templates/template_test_weights_compression.py
Outdated
Show resolved
Hide resolved
tests/cross_fw/test_templates/template_test_weights_compression.py
Outdated
Show resolved
Hide resolved
If compare 40 and 41 performance build, one can notice that AWQ is sometimes slowly. Although time for mixed-precision, applying compession and total time are better with 41 build, I wonder why there is some slowness. Is it measurement error or expected numbers? @nikita-savelyevv |
It is possible that AWQ part could become a bit slower for a small model like tiny-llama because the added compiled functions are the most effective for compressing large tensors. When operating with small tensors, compilation overhead can overshadow the speedup from compiled computation. That's why for example int8 tiny-llama compression is a bit slower after #2727 My assumption is that the same effect makes AWQ a bit slower in this case. It should not happen for larger models though. I will check this. Thanks for the observation. |
Thanks for the confirmation! |
Changes
Added data-aware support for the Torch backend for WeightCompression with Scale Estimation.
Introduced support for MeanVarianceReducer, MaxVarianceReducer, and MeanAbsMaxReducer.
Incorporated
torch.inference_mode()
context for WeightCompression.Reason for changes
These changes enable the utilization of data-aware Scale Estimation for the Torch backend, specifically leveraging CUDA devices for improved performance.
Related tickets
Ticket ID: 158974
Tests
Added a template for WeightCompression tests for both Torch and OV backends, covering data-aware and Scale Estimation scenarios.
Extended the test scope to include
tinyllama_data_aware
andtinyllama_scale_estimation_per_channel
for Torch.Added a new test case
tinyllama_scale_estimation_group_size_64
for both Torch and OV backends.Performance Metrics
Note: All CUDA results are obtained locally on a single RTX 3090.