Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU]Move RMSFusion pass ahead of ConvertPrecision #172

Closed

Conversation

ceciliapeng2011
Copy link

Move RMSFusion pass ahead of ConvertPrecision.... this makes a larger range of nodes run with compressed precision (including rms).

xufang-lisa and others added 30 commits August 28, 2024 03:22
…lkit#24417)

### Details:
 - *add new property for model cache encryption/decryption function*
- *encrypt/decrypt topology in CPU model cache if the callbacks are
provided.*

### Tickets:
 - *CVS-139600*

---------

Co-authored-by: Chen Peter <[email protected]>
…nvinotoolkit#26236)

size_t underflow might happen for the second dimension as well, for
example when layout is ndhwc.
New test cases extended for `ndhwc` layout as well

### Tickets:
 - 14877
…penvinotoolkit#26269)

### Details:
- Currently, threading library like TBB is provided by openvino::runtime
itself. It does not work well when multiple find_package(OpenVINO) are
used within the project and with different `COMPONENTS`
- E.g. here openvinotoolkit/openvino.genai#794
we have to add Threading component even for cases, when it's not really
used, because find_package within nested subdirectory populates
properties of openvino::runtime which is found in internal directory.
- Example of usage
openvinotoolkit/openvino_tokenizers#236
)

### Details:
- This PR fixes incorrect default buffers sizes configuration for static
model in sdpa_opt kernel

### Tickets:
 - [CVS-150773](https://jira.devtools.intel.com/browse/CVS-150773)
…upport gcc < 9.0 (openvinotoolkit#26190)

### Details:
- trying to compile npu plugin with gcc 8.5 or lower version resulted
into issues with unsupported intrinsics
 ``` c++
 <source>:6:17: error: '_mm_loadu_si64' was not declared in this scope
     __m128i a = _mm_loadu_si64((const __m128i *)(data));
                 ^~~~~~~~~~~~~~
<source>:6:17: note: suggested alternative: '_mm_loadl_epi64'
     __m128i a = _mm_loadu_si64((const __m128i *)(data));
```
 - technically they might be different since _mm_loadu_si64 specifically doesn't require data alignment while _mm_loadl_epi64 not specifying them
 - checked generated asm code(gcc 14.1) , well they are different, but tests with memory loading at 8 bits step showed no difference, so I assume both functions can work on unaligned arrays
![image](https://github.com/user-attachments/assets/c70ea4c2-2ab2-487d-b611-83ffe1a34ad3)

  - npuw unit test soon will be available in OV, with ongoing PR openvinotoolkit#25780, while i've tested them locally with new intrinsics

### Tickets:
 - *ticket-id* - 136269

Co-authored-by: Dmitry Matveev <[email protected]>
…it#26267)

### Details:
- Fix out-of-bound access from permute kernel
- It happened because the kernel loaded with vload even at the boundary


### Tickets:
 - 150360
…ches the squeeze axis (openvinotoolkit#26277)

### Details:
- Do not apply crop optimization for squeeze if the crop axis matches
the squeeze axis
…oolkit#26268)

### Details:
- Apparently, `github.event.merge_group.base_ref` for merge group checks
does not resolve into the same thing as `github.base_ref` for pull
request checks. It is an issue since merge queue beta:
[1](https://github.com/orgs/community/discussions/40277)
- `github.event.merge_group.base_ref` resolves into `refs/heads/master`
while we need only the `master` part to construct the remote cache
directory.

### Tickets:
 - *150744*
### Details:
 - *item1*
 - *...*

### Tickets:
 - *ticket-id*

---------

Signed-off-by: Chen Peter <[email protected]>
### Details:
 - *Change log level to debug*
 - *...*

### Tickets:
 - *ticket-id*
)

### Details:
 - Use OpenVINO-provided flags to enable AVX2 for NPUW

### Tickets:
 - 136004
### Details:
 - To avoid cmake warnings when PDPD is not installed
### Details:
 - Fix data size in test

### Tickets:
 - [*CVS-148605*](https://jira.devtools.intel.com/browse/CVS-148605)
…notoolkit#26276)

### Details:
 - *Do not set shape_changed to true when shape is not changed.*
- *Fix output layout does not have updated data_padding which is
calculated by calc_output_layouts in update_shape*

### Tickets:
 - *149773*
### Details:
- Fix the performance issues in aot_autograd path where constants are
being treated as inputs.

### Tickets:
 - https://jira.devtools.intel.com/browse/CVS-139183
…r fold_multiply_const=true option (openvinotoolkit#26280)

### Details:
- The PR restores MarkDequantizationSubgraph transformation behavior to
the state before openvinotoolkit#25783.
This is required to avoid compressed zero-points constant folding during
model conversion.

### Tickets:
 - *[CVS-150686](https://jira.devtools.intel.com/browse/CVS-150686)*
…t#26288)

**Details:** This PR finalizes and allows to infer JAX ViT model

**Tickets:** TBD

Signed-off-by: Kazantsev, Roman <[email protected]>
…t#26289)

### Details:
- Add prefix support for PagedAttention operation via existing
pa_sdpa_opt kernel by processing subsequence's tokens in sequential
mode, one by one for
- Moved responsibility for intermediate buffers reallocation from the
sdpa_opt kernel to kv_cache_update kernel (they both use the same data,
but now one of these buffers can be reused by the pa_sdpa_opt kernel, so
everything was moved to one place)
…otoolkit#26016)

### Details:
 - share the same L0 context from backend to compiler
 - and perform zeDestroy(context) in backend only 

### Tickets:
 - *ticket-id*

---------

Co-authored-by: Xin Wang <[email protected]>
…penvinotoolkit#26251)

### Details:
- Bitwise shift operations have support now for int16 an uint16 datatype
 - added unit and functional test
…vinotoolkit#26231)

### Details:
 - *item1*
 - *...*

### Tickets:
 - CVS-148548

---------

Signed-off-by: Min, Byung-il <[email protected]>
…yout optimizer (openvinotoolkit#25708)

### Details:
 - Allow Tranpose+Matmul+[Transpose] fusion for static shapes
- Allow Transpose in `MoveEltwiseUpThroughDataMov` for specific
Transpose -> Eltwise -> MatMul case
- Change order of `MoveEltwiseUpThroughDataMov` and
`ConvertMatMulToFullyConnected` to simplify callback
- Removed custom code for similar fusion in layout optimizer and related
debug knob
…sure proper data type propagation (openvinotoolkit#26299)

### Details:
This patch adds Validate pass call after IncreasePositionIdsPrecision to
ensure proper data type propagation

With this change the accuracy of llama-3-8b INT8 (and other LLMs
probably) can be restored to expected level
Before:
```
| Tasks  |Version|Filter|n-shot|    Metric     |   |Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|-----:|---|------|
|wikitext|      2|none  |     0|bits_per_byte  |↓  |0.6030|±  |N/A   |
|        |       |none  |     0|byte_perplexity|↓  |1.5189|±  |N/A   |
|        |       |none  |     0|word_perplexity|↓  |9.3472|±  |N/A   |
```

After:
```
| Tasks  |Version|Filter|n-shot|    Metric     |   |Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|-----:|---|------|
|wikitext|      2|none  |     0|bits_per_byte  |↓  |0.5351|±  |N/A   |
|        |       |none  |     0|byte_perplexity|↓  |1.4490|±  |N/A   |
|        |       |none  |     0|word_perplexity|↓  |7.2664|±  |N/A   |
```

### Tickets:
 - [CVS-147653](https://jira.devtools.intel.com/browse/CVS-147653)
…envinotoolkit#26295)

### Details:
- Use recursive_mutex instead of mutex in compilation context. Because
deadlock happens in context compilation(push_tash() -> remove_keys() in
single thread) when CPU_PINNING=ON

### Tickets:
 - 150220
### Details:
 - *Fix issue with not set decoder type rt_info*
 - *Remove frontend rt_info from model*

### Tickets:
 - *ticket-id*
ilya-lavrenov and others added 29 commits September 17, 2024 13:28
### Details:
 - *item1*
 - *...*

### Tickets:
 - *ticket-id*
…oolkit#26625)

Updates the requirements on [jax](https://github.com/google/jax) to
permit the latest version.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/google/jax/releases">jax's
releases</a>.</em></p>
<blockquote>
<h2>JAX release v0.4.33</h2>
<p>This is a patch release on top of jax 0.4.32, that fixes two bugs
found in that
release.</p>
<p>A TPU-only data corruption bug was found in the version of libtpu
pinned by
JAX 0.4.32, which manifested only if multiple TPU slices were present in
the
same job, for example, if training on multiple v5e slices.</p>
<p>This release fixes that issue by pinning a fixed version of
<code>libtpu-nightly</code>.</p>
<p>This release also fixes an inaccurate result for F64 tanh on CPU (<a
href="https://redirect.github.com/google/jax/issues/23590">#23590</a>).</p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/google/jax/blob/main/CHANGELOG.md">jax's
changelog</a>.</em></p>
<blockquote>
<h2>jax 0.4.33 (September 16, 2024)</h2>
<p>This is a patch release on top of jax 0.4.32, that fixes two bugs
found in that
release.</p>
<p>A TPU-only data corruption bug was found in the version of libtpu
pinned by
JAX 0.4.32, which manifested only if multiple TPU slices were present in
the
same job, for example, if training on multiple v5e slices.
This release fixes that issue by pinning a fixed version of
<code>libtpu</code>.</p>
<p>This release fixes an inaccurate result for F64 tanh on CPU (<a
href="https://redirect.github.com/google/jax/issues/23590">#23590</a>).</p>
<h2>jax 0.4.32 (September 11, 2024)</h2>
<p>Note: This release was yanked from PyPi because of a data corruption
bug on TPU.
See the 0.4.33 release notes for more details.</p>
<ul>
<li>
<p>New Functionality</p>
<ul>
<li>Added {func}<code>jax.extend.ffi.ffi_call</code> and
{func}<code>jax.extend.ffi.ffi_lowering</code>
to support the use of the new {ref}<code>ffi-tutorial</code> to
interface with custom
C++ and CUDA code from JAX.</li>
</ul>
</li>
<li>
<p>Changes</p>
<ul>
<li><code>jax_pmap_no_rank_reduction</code> flag is set to
<code>True</code> by default.
<ul>
<li>array[0] on a pmap result now introduces a reshape (use array[0:1]
instead).</li>
<li>The per-shard shape (accessable via jax_array.addressable_shards or
jax_array.addressable_data(0)) now has a leading (1, ...). Update code
that directly accesses shards accordingly. The rank of the
per-shard-shape
now matches that of the global shape which is the same behavior as jit.
This avoids costly reshapes when passing results from pmap into
jit.</li>
</ul>
</li>
<li><code>jax_enable_memories</code> flag is set to <code>True</code> by
default.</li>
<li>{mod}<code>jax.numpy</code> now supports v2023.12 of the Python
Array API Standard.
See {ref}<code>python-array-api</code> for more information.</li>
<li>Computations on the CPU backend may now be dispatched asynchronously
in
more cases. Previously non-parallel computations were always dispatched
synchronously. You can recover the old behavior by setting
<code>jax.config.update('jax_cpu_enable_async_dispatch',
False)</code>.</li>
<li>Added new {func}<code>jax.process_indices</code> function to replace
the
<code>jax.host_ids()</code> function that was deprecated in JAX
v0.2.13.</li>
<li>To align with the behavior of <code>numpy.fabs</code>,
<code>jax.numpy.fabs</code> has been
modified to no longer support <code>complex dtypes</code>.</li>
<li><code>jax.tree_util.register_dataclass</code> now checks that
<code>data_fields</code>
and <code>meta_fields</code> includes all dataclass fields with
<code>init=True</code>
and only them, if <code>nodetype</code> is a dataclass.</li>
<li>Several {mod}<code>jax.numpy</code> functions now have full
{class}<code>~jax.numpy.ufunc</code>
interfaces, including {obj}<code>~jax.numpy.add</code>,
{obj}<code>~jax.numpy.multiply</code>,
{obj}<code>~jax.numpy.bitwise_and</code>,
{obj}<code>~jax.numpy.bitwise_or</code>,
{obj}<code>~jax.numpy.bitwise_xor</code>,
{obj}<code>~jax.numpy.logical_and</code>,
{obj}<code>~jax.numpy.logical_and</code>, and
{obj}<code>~jax.numpy.logical_and</code>.</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/google/jax/commit/80e1c94de63e7f89667cdf35f38d8fe298e97a50"><code>80e1c94</code></a>
Prepare for v0.4.33 release.</li>
<li><a
href="https://github.com/google/jax/commit/1594d2f30fdbfebf693aba4a2b264e4a3e52acc6"><code>1594d2f</code></a>
Prepare for v0.4.32 release.</li>
<li><a
href="https://github.com/google/jax/commit/ed849ff9e0576dcee2514741b5ffa951a94e20a8"><code>ed849ff</code></a>
Make sure to call the superclass' <strong>init</strong>() on a newly
created instance in P...</li>
<li><a
href="https://github.com/google/jax/commit/2bd1fdead81581db08ee84a0d1f82c407ccd6b11"><code>2bd1fde</code></a>
Relax test tolerance in pinv test to fix a CI failure on Windows
CPU.</li>
<li><a
href="https://github.com/google/jax/commit/e869a9d65e568e36e95940db302f94f9b7b973c4"><code>e869a9d</code></a>
Merge pull request <a
href="https://redirect.github.com/google/jax/issues/23415">#23415</a>
from kaixih:key_value_seq_lengths</li>
<li><a
href="https://github.com/google/jax/commit/ea68f4569c5474f20e52b96ab88c287ab843130a"><code>ea68f45</code></a>
Internal change</li>
<li><a
href="https://github.com/google/jax/commit/49dd6ed8d891ee6b7bbfcf7cc425382a7235556b"><code>49dd6ed</code></a>
Disable a pallas export compatibility test that fails on TPU v6e.</li>
<li><a
href="https://github.com/google/jax/commit/808003b4e29e878349192e0f63fa1a2454ace56b"><code>808003b</code></a>
Update users of jax.tree.map() to be more careful about how they handle
Nones.</li>
<li><a
href="https://github.com/google/jax/commit/e3c4b20fa04893ad986c3184387fbd3817f1515d"><code>e3c4b20</code></a>
[Pallas] Implement tiled and swizzled Memref loads for Mosaic GPU via
&quot;GPUBlo...</li>
<li><a
href="https://github.com/google/jax/commit/c659dc9a011bf8ff604a7e23f916920ff717288b"><code>c659dc9</code></a>
[Pallas] Disable win32 gpu_ops_test.</li>
<li>Additional commits viewable in <a
href="https://github.com/google/jax/compare/jaxlib-v0.1.32...jax-v0.4.33">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ilya Lavrenov <[email protected]>
Co-authored-by: Roman Kazantsev <[email protected]>
### Details:
 - *item1*
 - *...*

### Tickets:
 - *ticket-id*
### Details:
- enable `CMAKE_COMPILE_WARNING_AS_ERROR` for _intel\_npu_ directory
(except _thirdparty_)
 - remove warning suppression for deprecated declarations
 - fix existing warnings

### Tickets:
 - *134706*
### Details:
 - *item1*
 - *...*

### Tickets:
 - *ticket-id*
…ply-by-one pattern (openvinotoolkit#26641)

**Details:** In customer model, there is a sub-graph under ShapeOf that
is equivalent to multiplication by one. It can be eliminated and make
LSTMSequence fusion possible

**Ticket:** 149687

Signed-off-by: Kazantsev, Roman <[email protected]>
…olkit#26631)

### Details:
 - the latest available torch version for x86 - 2.2.2 
 
### Tickets:
 - *ticket-id*
…toolkit#26638)

### Details:
 - *Implement extension versions properly for each method*

### Tickets:
 - *EISW-61724*
…t#26177)

### Details:
 - The Constant `get_vector` works correctly for low precisions.
- Initialize not used bits in Constant buffer for low precisions to
avoid undefined values.

### Tickets:
 - CVS-149867
…oolkit#25997)

### Details:
- *This PR adds functional tests for NPUW launched with online
partitioning, mostly the same tests that were added for unpartitioned
NPUW, except for some interesting ones for folding and pipelining*
- *This PR also introduces 1 accuracy test, that, however, checked on
simple (in terms of computations, not structure) model for now*

### Tickets:
 - *ticket-id*
…openvinotoolkit#26662)

### Details:
 - Enabled parallel call
- Fixed tests to be compatible with pytest-xdist plugin (pass function
name instead of reference)
- `os.path.dirname` caused saving model in `/out` directory instead of
`/out/{temp_dir}` beacause it didn't have '/' char at the end and it
treated `temp_dir` like a file

### Tickets:
 - [None](openvinotoolkit#20920)
 
Without parallel execution
![Screenshot from 2024-09-18
14-15-16](https://github.com/user-attachments/assets/f1b00954-de59-445a-904f-5b13819c0971)

With parallel execution (8 cpu cores)
![Screenshot from 2024-09-18
14-32-48](https://github.com/user-attachments/assets/2fc144cc-f771-43aa-909b-f41dedc1ccca)
…t#26554)

Providing info about support for MXFP4 data format in quantization on
CPU. This PR addresses JIRA ticket no. 151042.
…otoolkit#26640)

### Details:
 - support jax.lax.ge and jax.lax.gt operation
 - create unit tests

### Tickets:
 - [None](openvinotoolkit#26572)

---------

Co-authored-by: Roman Kazantsev <[email protected]>
**Details:** Fix performance inefficiencies

**Ticket:** 123298

Signed-off-by: Kazantsev, Roman <[email protected]>
…openvinotoolkit#26501)

### Details:
- For fp model, some convolutions may not be compressed to fp16
depending on the transformation policy and those convolutions may have
the fused node which is of fp16. Then convolution node input data type
will be fp32 while output data type fp16. Convolution needs to support
this case.

### Tickets:
 - 147689
…6599)

### Details:
- Target pattern: FCs to be fused by horizontal fusing pass and they
have Add users which can be regarded as bias add. Here if we fuse the
FCs as is, the fused pattern will be fused_fc -> VariadicSplit -> Add so
the Adds cannot be fused to the FCs.
- This PR sets such Add users as the FC's bias inputs so that the fused
FC can handle them as fused bias.

### Tickets:
 - CVS-151841
…kit#26660)

### Details:
- *Currently, `loops_to_split` in `MHAParallelWAOptimizer` are stored in
unordered_map, so elements order is not determined. This sporadically
leads to the situation when loop last iteration has work_amount more
than main body's increment. This might lead to failures*
- *In this PR, 'loops_to_split' are stored in vector, so loop
information updates are always called for expanded loop infos in
determined order: FIRST_ITER->MAIN_BODY->LAST_ITER*
- *Also, the corresponding assert is added to
`InsertSpecificIterations::get_decomposed_loop_work_amount` in order to
throw an exception on early stage in case of incorrect configuration.
This assert also allows to cover the changes by the existing tests (some
of them fail if assert is added but the fix is not applied)*

### Tickets:
 - *N/A*
### Details:
 - *Could not deserialize RMS node during reading model from cache*
 - *...*

### Tickets:
 - *152740*
…oolkit#21414)

### Details:
 - `ShapeOf` preserve lower bound when upper is infinite

### Tickets:
 - [CVS-126430](https://jira.devtools.intel.com/browse/CVS-126430)
…kit#26383)

### Details:
- Fix for failing GPU functional i16 test. The problem is that i16 input
is wrongly converted to f32 in constant and parameter op.
- Had to disable i16 case for Deformable conv, which won't work with
this fix. THe motivation is that Deformable conv on GPU supports only
f16, f32 and int8 types - does not support i16 case, which was working
only due to some implicit type transformation which this PR changes.
### Details:
 - Add guidelines how to test new js api functionality
 - Add guide how to extend JS API functionality 


### Tickets:
- [CVS-151489](https://jira.devtools.intel.com/browse/CVS-151489)
[CVS-151492](https://jira.devtools.intel.com/browse/CVS-151492)

---------

Co-authored-by: Tatiana Savina <[email protected]>
### Details:
 - *item1*
 - *...*

### Tickets:
 - *ticket-id*

Co-authored-by: Karol Blaszczak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.