forked from openvinotoolkit/openvino
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU]Move RMSFusion pass ahead of ConvertPrecision #172
Closed
ceciliapeng2011
wants to merge
10,000
commits into
slyalin:master
from
ceciliapeng2011:cecilia/move/rmsfusion
Closed
[GPU]Move RMSFusion pass ahead of ConvertPrecision #172
ceciliapeng2011
wants to merge
10,000
commits into
slyalin:master
from
ceciliapeng2011:cecilia/move/rmsfusion
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…lkit#24417) ### Details: - *add new property for model cache encryption/decryption function* - *encrypt/decrypt topology in CPU model cache if the callbacks are provided.* ### Tickets: - *CVS-139600* --------- Co-authored-by: Chen Peter <[email protected]>
…nvinotoolkit#26236) size_t underflow might happen for the second dimension as well, for example when layout is ndhwc. New test cases extended for `ndhwc` layout as well ### Tickets: - 14877
…penvinotoolkit#26269) ### Details: - Currently, threading library like TBB is provided by openvino::runtime itself. It does not work well when multiple find_package(OpenVINO) are used within the project and with different `COMPONENTS` - E.g. here openvinotoolkit/openvino.genai#794 we have to add Threading component even for cases, when it's not really used, because find_package within nested subdirectory populates properties of openvino::runtime which is found in internal directory. - Example of usage openvinotoolkit/openvino_tokenizers#236
…er (openvinotoolkit#26273) Fixed alignment in model caching graphs.
) ### Details: - This PR fixes incorrect default buffers sizes configuration for static model in sdpa_opt kernel ### Tickets: - [CVS-150773](https://jira.devtools.intel.com/browse/CVS-150773)
…upport gcc < 9.0 (openvinotoolkit#26190) ### Details: - trying to compile npu plugin with gcc 8.5 or lower version resulted into issues with unsupported intrinsics ``` c++ <source>:6:17: error: '_mm_loadu_si64' was not declared in this scope __m128i a = _mm_loadu_si64((const __m128i *)(data)); ^~~~~~~~~~~~~~ <source>:6:17: note: suggested alternative: '_mm_loadl_epi64' __m128i a = _mm_loadu_si64((const __m128i *)(data)); ``` - technically they might be different since _mm_loadu_si64 specifically doesn't require data alignment while _mm_loadl_epi64 not specifying them - checked generated asm code(gcc 14.1) , well they are different, but tests with memory loading at 8 bits step showed no difference, so I assume both functions can work on unaligned arrays ![image](https://github.com/user-attachments/assets/c70ea4c2-2ab2-487d-b611-83ffe1a34ad3) - npuw unit test soon will be available in OV, with ongoing PR openvinotoolkit#25780, while i've tested them locally with new intrinsics ### Tickets: - *ticket-id* - 136269 Co-authored-by: Dmitry Matveev <[email protected]>
…it#26267) ### Details: - Fix out-of-bound access from permute kernel - It happened because the kernel loaded with vload even at the boundary ### Tickets: - 150360
…ches the squeeze axis (openvinotoolkit#26277) ### Details: - Do not apply crop optimization for squeeze if the crop axis matches the squeeze axis
…oolkit#26268) ### Details: - Apparently, `github.event.merge_group.base_ref` for merge group checks does not resolve into the same thing as `github.base_ref` for pull request checks. It is an issue since merge queue beta: [1](https://github.com/orgs/community/discussions/40277) - `github.event.merge_group.base_ref` resolves into `refs/heads/master` while we need only the `master` part to construct the remote cache directory. ### Tickets: - *150744*
### Details: - *item1* - *...* ### Tickets: - *ticket-id* --------- Signed-off-by: Chen Peter <[email protected]>
### Details: - *Change log level to debug* - *...* ### Tickets: - *ticket-id*
### Details: - To avoid cmake warnings when PDPD is not installed
### Details: - Fix data size in test ### Tickets: - [*CVS-148605*](https://jira.devtools.intel.com/browse/CVS-148605)
…notoolkit#26276) ### Details: - *Do not set shape_changed to true when shape is not changed.* - *Fix output layout does not have updated data_padding which is calculated by calc_output_layouts in update_shape* ### Tickets: - *149773*
### Details: - Fix the performance issues in aot_autograd path where constants are being treated as inputs. ### Tickets: - https://jira.devtools.intel.com/browse/CVS-139183
…r fold_multiply_const=true option (openvinotoolkit#26280) ### Details: - The PR restores MarkDequantizationSubgraph transformation behavior to the state before openvinotoolkit#25783. This is required to avoid compressed zero-points constant folding during model conversion. ### Tickets: - *[CVS-150686](https://jira.devtools.intel.com/browse/CVS-150686)*
…t#26288) **Details:** This PR finalizes and allows to infer JAX ViT model **Tickets:** TBD Signed-off-by: Kazantsev, Roman <[email protected]>
…lkit#26291) ### Details: - *Significantly speed up LogSoftmax operation* ### Tickets: - *[148550](https://jira.devtools.intel.com/browse/CVS-148550)*
…t#26289) ### Details: - Add prefix support for PagedAttention operation via existing pa_sdpa_opt kernel by processing subsequence's tokens in sequential mode, one by one for - Moved responsibility for intermediate buffers reallocation from the sdpa_opt kernel to kv_cache_update kernel (they both use the same data, but now one of these buffers can be reused by the pa_sdpa_opt kernel, so everything was moved to one place)
…otoolkit#26016) ### Details: - share the same L0 context from backend to compiler - and perform zeDestroy(context) in backend only ### Tickets: - *ticket-id* --------- Co-authored-by: Xin Wang <[email protected]>
…penvinotoolkit#26251) ### Details: - Bitwise shift operations have support now for int16 an uint16 datatype - added unit and functional test
…vinotoolkit#26231) ### Details: - *item1* - *...* ### Tickets: - CVS-148548 --------- Signed-off-by: Min, Byung-il <[email protected]>
### Tickets: - *148717*
…yout optimizer (openvinotoolkit#25708) ### Details: - Allow Tranpose+Matmul+[Transpose] fusion for static shapes - Allow Transpose in `MoveEltwiseUpThroughDataMov` for specific Transpose -> Eltwise -> MatMul case - Change order of `MoveEltwiseUpThroughDataMov` and `ConvertMatMulToFullyConnected` to simplify callback - Removed custom code for similar fusion in layout optimizer and related debug knob
…sure proper data type propagation (openvinotoolkit#26299) ### Details: This patch adds Validate pass call after IncreasePositionIdsPrecision to ensure proper data type propagation With this change the accuracy of llama-3-8b INT8 (and other LLMs probably) can be restored to expected level Before: ``` | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|---------------|---|-----:|---|------| |wikitext| 2|none | 0|bits_per_byte |↓ |0.6030|± |N/A | | | |none | 0|byte_perplexity|↓ |1.5189|± |N/A | | | |none | 0|word_perplexity|↓ |9.3472|± |N/A | ``` After: ``` | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|---------------|---|-----:|---|------| |wikitext| 2|none | 0|bits_per_byte |↓ |0.5351|± |N/A | | | |none | 0|byte_perplexity|↓ |1.4490|± |N/A | | | |none | 0|word_perplexity|↓ |7.2664|± |N/A | ``` ### Tickets: - [CVS-147653](https://jira.devtools.intel.com/browse/CVS-147653)
…envinotoolkit#26295) ### Details: - Use recursive_mutex instead of mutex in compilation context. Because deadlock happens in context compilation(push_tash() -> remove_keys() in single thread) when CPU_PINNING=ON ### Tickets: - 150220
### Details: - *Fix issue with not set decoder type rt_info* - *Remove frontend rt_info from model* ### Tickets: - *ticket-id*
### Details: - *item1* - *...* ### Tickets: - *ticket-id*
…oolkit#26625) Updates the requirements on [jax](https://github.com/google/jax) to permit the latest version. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/google/jax/releases">jax's releases</a>.</em></p> <blockquote> <h2>JAX release v0.4.33</h2> <p>This is a patch release on top of jax 0.4.32, that fixes two bugs found in that release.</p> <p>A TPU-only data corruption bug was found in the version of libtpu pinned by JAX 0.4.32, which manifested only if multiple TPU slices were present in the same job, for example, if training on multiple v5e slices.</p> <p>This release fixes that issue by pinning a fixed version of <code>libtpu-nightly</code>.</p> <p>This release also fixes an inaccurate result for F64 tanh on CPU (<a href="https://redirect.github.com/google/jax/issues/23590">#23590</a>).</p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/google/jax/blob/main/CHANGELOG.md">jax's changelog</a>.</em></p> <blockquote> <h2>jax 0.4.33 (September 16, 2024)</h2> <p>This is a patch release on top of jax 0.4.32, that fixes two bugs found in that release.</p> <p>A TPU-only data corruption bug was found in the version of libtpu pinned by JAX 0.4.32, which manifested only if multiple TPU slices were present in the same job, for example, if training on multiple v5e slices. This release fixes that issue by pinning a fixed version of <code>libtpu</code>.</p> <p>This release fixes an inaccurate result for F64 tanh on CPU (<a href="https://redirect.github.com/google/jax/issues/23590">#23590</a>).</p> <h2>jax 0.4.32 (September 11, 2024)</h2> <p>Note: This release was yanked from PyPi because of a data corruption bug on TPU. See the 0.4.33 release notes for more details.</p> <ul> <li> <p>New Functionality</p> <ul> <li>Added {func}<code>jax.extend.ffi.ffi_call</code> and {func}<code>jax.extend.ffi.ffi_lowering</code> to support the use of the new {ref}<code>ffi-tutorial</code> to interface with custom C++ and CUDA code from JAX.</li> </ul> </li> <li> <p>Changes</p> <ul> <li><code>jax_pmap_no_rank_reduction</code> flag is set to <code>True</code> by default. <ul> <li>array[0] on a pmap result now introduces a reshape (use array[0:1] instead).</li> <li>The per-shard shape (accessable via jax_array.addressable_shards or jax_array.addressable_data(0)) now has a leading (1, ...). Update code that directly accesses shards accordingly. The rank of the per-shard-shape now matches that of the global shape which is the same behavior as jit. This avoids costly reshapes when passing results from pmap into jit.</li> </ul> </li> <li><code>jax_enable_memories</code> flag is set to <code>True</code> by default.</li> <li>{mod}<code>jax.numpy</code> now supports v2023.12 of the Python Array API Standard. See {ref}<code>python-array-api</code> for more information.</li> <li>Computations on the CPU backend may now be dispatched asynchronously in more cases. Previously non-parallel computations were always dispatched synchronously. You can recover the old behavior by setting <code>jax.config.update('jax_cpu_enable_async_dispatch', False)</code>.</li> <li>Added new {func}<code>jax.process_indices</code> function to replace the <code>jax.host_ids()</code> function that was deprecated in JAX v0.2.13.</li> <li>To align with the behavior of <code>numpy.fabs</code>, <code>jax.numpy.fabs</code> has been modified to no longer support <code>complex dtypes</code>.</li> <li><code>jax.tree_util.register_dataclass</code> now checks that <code>data_fields</code> and <code>meta_fields</code> includes all dataclass fields with <code>init=True</code> and only them, if <code>nodetype</code> is a dataclass.</li> <li>Several {mod}<code>jax.numpy</code> functions now have full {class}<code>~jax.numpy.ufunc</code> interfaces, including {obj}<code>~jax.numpy.add</code>, {obj}<code>~jax.numpy.multiply</code>, {obj}<code>~jax.numpy.bitwise_and</code>, {obj}<code>~jax.numpy.bitwise_or</code>, {obj}<code>~jax.numpy.bitwise_xor</code>, {obj}<code>~jax.numpy.logical_and</code>, {obj}<code>~jax.numpy.logical_and</code>, and {obj}<code>~jax.numpy.logical_and</code>.</li> </ul> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/google/jax/commit/80e1c94de63e7f89667cdf35f38d8fe298e97a50"><code>80e1c94</code></a> Prepare for v0.4.33 release.</li> <li><a href="https://github.com/google/jax/commit/1594d2f30fdbfebf693aba4a2b264e4a3e52acc6"><code>1594d2f</code></a> Prepare for v0.4.32 release.</li> <li><a href="https://github.com/google/jax/commit/ed849ff9e0576dcee2514741b5ffa951a94e20a8"><code>ed849ff</code></a> Make sure to call the superclass' <strong>init</strong>() on a newly created instance in P...</li> <li><a href="https://github.com/google/jax/commit/2bd1fdead81581db08ee84a0d1f82c407ccd6b11"><code>2bd1fde</code></a> Relax test tolerance in pinv test to fix a CI failure on Windows CPU.</li> <li><a href="https://github.com/google/jax/commit/e869a9d65e568e36e95940db302f94f9b7b973c4"><code>e869a9d</code></a> Merge pull request <a href="https://redirect.github.com/google/jax/issues/23415">#23415</a> from kaixih:key_value_seq_lengths</li> <li><a href="https://github.com/google/jax/commit/ea68f4569c5474f20e52b96ab88c287ab843130a"><code>ea68f45</code></a> Internal change</li> <li><a href="https://github.com/google/jax/commit/49dd6ed8d891ee6b7bbfcf7cc425382a7235556b"><code>49dd6ed</code></a> Disable a pallas export compatibility test that fails on TPU v6e.</li> <li><a href="https://github.com/google/jax/commit/808003b4e29e878349192e0f63fa1a2454ace56b"><code>808003b</code></a> Update users of jax.tree.map() to be more careful about how they handle Nones.</li> <li><a href="https://github.com/google/jax/commit/e3c4b20fa04893ad986c3184387fbd3817f1515d"><code>e3c4b20</code></a> [Pallas] Implement tiled and swizzled Memref loads for Mosaic GPU via "GPUBlo...</li> <li><a href="https://github.com/google/jax/commit/c659dc9a011bf8ff604a7e23f916920ff717288b"><code>c659dc9</code></a> [Pallas] Disable win32 gpu_ops_test.</li> <li>Additional commits viewable in <a href="https://github.com/google/jax/compare/jaxlib-v0.1.32...jax-v0.4.33">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Ilya Lavrenov <[email protected]> Co-authored-by: Roman Kazantsev <[email protected]>
### Details: - *item1* - *...* ### Tickets: - *ticket-id*
### Tickets: - *151796*
### Details: - enable `CMAKE_COMPILE_WARNING_AS_ERROR` for _intel\_npu_ directory (except _thirdparty_) - remove warning suppression for deprecated declarations - fix existing warnings ### Tickets: - *134706*
### Details: - *item1* - *...* ### Tickets: - *ticket-id*
…ply-by-one pattern (openvinotoolkit#26641) **Details:** In customer model, there is a sub-graph under ShapeOf that is equivalent to multiplication by one. It can be eliminated and make LSTMSequence fusion possible **Ticket:** 149687 Signed-off-by: Kazantsev, Roman <[email protected]>
…penvinotoolkit#26585) ### Tickets: - *151717*
…olkit#26631) ### Details: - the latest available torch version for x86 - 2.2.2 ### Tickets: - *ticket-id*
…toolkit#26638) ### Details: - *Implement extension versions properly for each method* ### Tickets: - *EISW-61724*
…t#26177) ### Details: - The Constant `get_vector` works correctly for low precisions. - Initialize not used bits in Constant buffer for low precisions to avoid undefined values. ### Tickets: - CVS-149867
…vinotoolkit#26597) Co-authored-by: Andrei Kashchikhin <[email protected]>
…oolkit#25997) ### Details: - *This PR adds functional tests for NPUW launched with online partitioning, mostly the same tests that were added for unpartitioned NPUW, except for some interesting ones for folding and pipelining* - *This PR also introduces 1 accuracy test, that, however, checked on simple (in terms of computations, not structure) model for now* ### Tickets: - *ticket-id*
…openvinotoolkit#26662) ### Details: - Enabled parallel call - Fixed tests to be compatible with pytest-xdist plugin (pass function name instead of reference) - `os.path.dirname` caused saving model in `/out` directory instead of `/out/{temp_dir}` beacause it didn't have '/' char at the end and it treated `temp_dir` like a file ### Tickets: - [None](openvinotoolkit#20920) Without parallel execution ![Screenshot from 2024-09-18 14-15-16](https://github.com/user-attachments/assets/f1b00954-de59-445a-904f-5b13819c0971) With parallel execution (8 cpu cores) ![Screenshot from 2024-09-18 14-32-48](https://github.com/user-attachments/assets/2fc144cc-f771-43aa-909b-f41dedc1ccca)
…t#26554) Providing info about support for MXFP4 data format in quantization on CPU. This PR addresses JIRA ticket no. 151042.
…otoolkit#26640) ### Details: - support jax.lax.ge and jax.lax.gt operation - create unit tests ### Tickets: - [None](openvinotoolkit#26572) --------- Co-authored-by: Roman Kazantsev <[email protected]>
**Details:** Fix performance inefficiencies **Ticket:** 123298 Signed-off-by: Kazantsev, Roman <[email protected]>
…openvinotoolkit#26501) ### Details: - For fp model, some convolutions may not be compressed to fp16 depending on the transformation policy and those convolutions may have the fused node which is of fp16. Then convolution node input data type will be fp32 while output data type fp16. Convolution needs to support this case. ### Tickets: - 147689
…6599) ### Details: - Target pattern: FCs to be fused by horizontal fusing pass and they have Add users which can be regarded as bias add. Here if we fuse the FCs as is, the fused pattern will be fused_fc -> VariadicSplit -> Add so the Adds cannot be fused to the FCs. - This PR sets such Add users as the FC's bias inputs so that the fused FC can handle them as fused bias. ### Tickets: - CVS-151841
…kit#26660) ### Details: - *Currently, `loops_to_split` in `MHAParallelWAOptimizer` are stored in unordered_map, so elements order is not determined. This sporadically leads to the situation when loop last iteration has work_amount more than main body's increment. This might lead to failures* - *In this PR, 'loops_to_split' are stored in vector, so loop information updates are always called for expanded loop infos in determined order: FIRST_ITER->MAIN_BODY->LAST_ITER* - *Also, the corresponding assert is added to `InsertSpecificIterations::get_decomposed_loop_work_amount` in order to throw an exception on early stage in case of incorrect configuration. This assert also allows to cover the changes by the existing tests (some of them fail if assert is added but the fix is not applied)* ### Tickets: - *N/A*
### Details: - *Could not deserialize RMS node during reading model from cache* - *...* ### Tickets: - *152740*
…oolkit#21414) ### Details: - `ShapeOf` preserve lower bound when upper is infinite ### Tickets: - [CVS-126430](https://jira.devtools.intel.com/browse/CVS-126430)
…kit#26383) ### Details: - Fix for failing GPU functional i16 test. The problem is that i16 input is wrongly converted to f32 in constant and parameter op. - Had to disable i16 case for Deformable conv, which won't work with this fix. THe motivation is that Deformable conv on GPU supports only f16, f32 and int8 types - does not support i16 case, which was working only due to some implicit type transformation which this PR changes.
### Details: - Add guidelines how to test new js api functionality - Add guide how to extend JS API functionality ### Tickets: - [CVS-151489](https://jira.devtools.intel.com/browse/CVS-151489) [CVS-151492](https://jira.devtools.intel.com/browse/CVS-151492) --------- Co-authored-by: Tatiana Savina <[email protected]>
### Details: - *item1* - *...* ### Tickets: - *ticket-id* Co-authored-by: Karol Blaszczak <[email protected]>
… for in rerunner (openvinotoolkit#26691) ### Tickets: - *152565*
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Move RMSFusion pass ahead of ConvertPrecision.... this makes a larger range of nodes run with compressed precision (including rms).