Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Torch][WeightCompression] Add Scale Estimation data-aware support #3179

Open
wants to merge 50 commits into
base: develop
Choose a base branch
from

Conversation

kshpv
Copy link
Collaborator

@kshpv kshpv commented Jan 8, 2025

Changes

Added data-aware support for the Torch backend for WeightCompression with Scale Estimation.
Introduced support for MeanVarianceReducer, MaxVarianceReducer, and MeanAbsMaxReducer.
Incorporated torch.inference_mode() context for WeightCompression.

Reason for changes

These changes enable the utilization of data-aware Scale Estimation for the Torch backend, specifically leveraging CUDA devices for improved performance.

Related tickets

Ticket ID: 158974

Tests

Added a template for WeightCompression tests for both Torch and OV backends, covering data-aware and Scale Estimation scenarios.
Extended the test scope to include tinyllama_data_aware and tinyllama_scale_estimation_per_channel for Torch.
Added a new test case tinyllama_scale_estimation_group_size_64 for both Torch and OV backends.

Performance Metrics

Note: All CUDA results are obtained locally on a single RTX 3090.

Model Backend Metric Name Metric Value Num int4 Num int8 Compression Time (from Performance Job) RAM MiB (from Performance Job)
tinyllama_data_aware OV Similarity 0.8577 94 124 0:01:28 8545
tinyllama_data_aware TORCH Similarity 0.8577 94 124 0:02:15 1225
tinyllama_data_aware TORCH (CUDA) Similarity 0.8577 94 124 0:00:28 -
tinyllama_scale_estimation_per_channel OV Similarity 0.8139 188 124 0:02:57 8681
tinyllama_scale_estimation_per_channel TORCH Similarity 0.8139 188 124 0:03:25 5472
tinyllama_scale_estimation_per_channel TORCH (CUDA) Similarity 0.8139 188 124 0:00:35 -
tinyllama_scale_estimation_group_size_64 OV Similarity 0.8566 94 124 0:04:17 8681
tinyllama_scale_estimation_group_size_64 TORCH Similarity 0.8566 94 124 0:04:01 5575
tinyllama_scale_estimation_group_size_64 TORCH (CUDA) Similarity 0.8566 94 124 0:00:36 -

@kshpv kshpv requested a review from a team as a code owner January 8, 2025 09:53
@github-actions github-actions bot added NNCF PT Pull requests that updates NNCF PyTorch NNCF Common Pull request that updates NNCF Common experimental NNCF OpenVINO Pull requests that updates NNCF OpenVINO NNCF PTQ Pull requests that updates NNCF PTQ labels Jan 8, 2025
@kshpv kshpv changed the title [TorchScale est torch [Torch][WeightCompression] Add Scale Estimation support Jan 8, 2025
@kshpv kshpv marked this pull request as draft January 8, 2025 09:54
@MaximProshin MaximProshin changed the title [Torch][WeightCompression] Add Scale Estimation support [Torch][WeightCompression] Add Scale Estimation data-aware support Jan 8, 2025
@kshpv
Copy link
Collaborator Author

kshpv commented Jan 8, 2025

weight compression build - 291

@kshpv
Copy link
Collaborator Author

kshpv commented Jan 8, 2025

The proposed example can be added as a follow-up PR - will be excluded from this PR

Comment on lines +467 to +468
def _reduce_out_of_place(self, x: List[Tensor]) -> List[Tensor]:
x = x[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR.
Should this function really take the list?
Seems like it always contains one element. Is it only RawReducer that consumes the whole list and the rest works with one element?
Can we inherit all these "one-element" classes from a class that defines a method with a single element?
IMO, it would be more clear when one element expected and when not.
@daniil-lyakhov

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducers was designed to receive several inputs and output several outputs as well. For now, there is no such reducers, but the possible use case - quantization error (fp32 input, int8 input) -> diff

We can create a class for that, it will make hierarchy tree more complicated, we than should introduce method like _reduce_out_of_place_one_input method, and I don't think this will make code more readable

Copy link
Contributor

@ljaljushkin ljaljushkin Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all reducers use one input, it should be relatively easy, isn't?

  1. keep in base class _reduce_out_of_place for many inputs
  2. subclass an intermediate class with _reduce_out_of_place_one_input
  3. rename inheritance from base class to intermediate class
  4. rename _reduce_out_of_place to _reduce_out_of_place_one_input.

@kshpv kshpv requested review from ljaljushkin and alexsu52 January 22, 2025 14:48
@ljaljushkin
Copy link
Contributor

@kshpv Have you run the performance job after merging with develop? Could you share it?

@kshpv kshpv requested review from ljaljushkin January 24, 2025 10:03
@kshpv
Copy link
Collaborator Author

kshpv commented Jan 24, 2025

@kshpv Have you run the performance job after merging with develop? Could you share it?

Sure, performance build 41

@ljaljushkin
Copy link
Contributor

If compare 40 and 41 performance build, one can notice that AWQ is sometimes slowly.
40 build doesn't include #2727
image
41 build includes #2727
image

Although time for mixed-precision, applying compession and total time are better with 41 build, I wonder why there is some slowness. Is it measurement error or expected numbers? @nikita-savelyevv

@nikita-savelyevv
Copy link
Collaborator

If compare 40 and 41 performance build, one can notice that AWQ is sometimes slowly. 40 build doesn't include #2727 image 41 build includes #2727 image

Although time for mixed-precision, applying compession and total time are better with 41 build, I wonder why there is some slowness. Is it measurement error or expected numbers? @nikita-savelyevv

It is possible that AWQ part could become a bit slower for a small model like tiny-llama because the added compiled functions are the most effective for compressing large tensors. When operating with small tensors, compilation overhead can overshadow the speedup from compiled computation.

That's why for example int8 tiny-llama compression is a bit slower after #2727
image

My assumption is that the same effect makes AWQ a bit slower in this case. It should not happen for larger models though. I will check this. Thanks for the observation.

@ljaljushkin
Copy link
Contributor

ljaljushkin commented Jan 24, 2025

t the same effect makes AWQ a bit slower in this case. It should not happen for larger models though. I will check this. Thanks for the observa

Thanks for the confirmation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental NNCF Common Pull request that updates NNCF Common NNCF OpenVINO Pull requests that updates NNCF OpenVINO NNCF PT Pull requests that updates NNCF PyTorch NNCF PTQ Pull requests that updates NNCF PTQ
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants