[NPUW] Fix memory consumption with bank uids #28382

smirnov-alexey · 2025-01-10T16:15:11Z

Introduced in #28282

dmatveev · 2025-01-10T17:33:05Z

src/plugins/intel_npu/src/plugin/npuw/weights_bank.cpp

    // uid may be coming from a 2nd (3rd, ...) model
    // detach the tensor here just in case
    const_cast<LazyTensor&>(iter_device->second.lt).detach();
+    // Also detach from registered tensors
+    auto it_registered = device_bank.registered_tensors.find(iter_device->second.lt);
+    NPUW_ASSERT(it_registered != device_bank.registered_tensors.end());
+    const_cast<LazyTensor&>(it_registered->first).detach();


Previously our .get() had effects. If the requested LazyTensor hasn't been evaluated yet, it is evaluated and copied to the device memory. And we did .detach() right after.

After the recent changes, .get() is just .get(). So here's the question - if there's no allocation happens at this point, why what has to be detached hasn't been detached at this point yet?

It seems we have too many .detaches() now - it just means our flow is not very clear.

dmatveev · 2025-01-10T17:34:42Z

src/plugins/intel_npu/src/plugin/npuw/weights_bank.cpp

+    // Also detach from registered tensors
+    auto it_registered = device_bank.registered_tensors.find(iter_device->second.lt);


Why doesn't this happen on evaluate_and_allocate() ?

dmatveev · 2025-01-10T17:54:14Z

src/plugins/intel_npu/src/plugin/npuw/weights_bank.cpp

+    // Also detach from registered tensors
+    auto it_registered = dbank.registered_tensors.find(tensor);
+    NPUW_ASSERT(it_registered != dbank.registered_tensors.end());
+    const_cast<LazyTensor&>(it_registered->first).detach();
+
    return allocated_tensor;


What we've had before the changes:

A LazyTensor was a key to the device tensor (for the transition from lazy_closure to closure)

Kvcache and Prefill models had its own individual LazyTensors (as distinct objects), populated from the original weights independently

When a compiled model queried a real tensor for the closure, it passed LazyTensor as a key and Bank called .detatch() in exchange

That is, the memory referred to twice was detached first in kvcache and then in prefill model, so finally cleared.

What do you have now:

A LazyTensor is a key to ID

First kvcache model exchanges its LazyTensors to IDs - this is where IDs are registered and LTs are collected inside

Then kvcache model evaluates the full bank - this is the point where LazyTensors should be detached

After that prefill model comes with its own LazyTensors and there's a hit, same IDs are returned

Evaluate on the bank essentially does nothing (and by the way you didn't optimize this part per past review).

We end up with memory leaked as the 2nd set of tensors was not detached.

How does detaching the existing registered tensors in the bank helps the problem? I believe this is what you catch in the second get, but in this case your internal storage still refers to the first set of LTs, not the second one.

The right solution is here:

Detach the LT in the bank once it is evaluated

On registerLT, detach the incoming LT if it was found it the bank. That's it.

dmatveev · 2025-01-10T17:56:54Z

src/plugins/intel_npu/src/plugin/npuw/weights_bank.cpp

+
    return allocated_tensor;


This method has a return value but nobody is using it. I'd drop the body of this function completely - it will be easier to do all of this in the parent loop.

Detach lazy tensor

a78ddcc

smirnov-alexey added Code Freeze category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Jan 10, 2025

smirnov-alexey added this to the 2025.0 milestone Jan 10, 2025

smirnov-alexey requested a review from dmatveev January 10, 2025 16:15

smirnov-alexey assigned dmatveev Jan 10, 2025

smirnov-alexey requested review from a team as code owners January 10, 2025 16:15

dmatveev requested changes Jan 10, 2025

View reviewed changes

dmatveev reviewed Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPUW] Fix memory consumption with bank uids #28382

[NPUW] Fix memory consumption with bank uids #28382

smirnov-alexey commented Jan 10, 2025

dmatveev Jan 10, 2025

dmatveev Jan 10, 2025

dmatveev Jan 10, 2025

dmatveev Jan 10, 2025

		// Also detach from registered tensors
		auto it_registered = device_bank.registered_tensors.find(iter_device->second.lt);

[NPUW] Fix memory consumption with bank uids #28382

Are you sure you want to change the base?

[NPUW] Fix memory consumption with bank uids #28382

Conversation

smirnov-alexey commented Jan 10, 2025

dmatveev Jan 10, 2025

Choose a reason for hiding this comment

dmatveev Jan 10, 2025

Choose a reason for hiding this comment

dmatveev Jan 10, 2025

Choose a reason for hiding this comment

dmatveev Jan 10, 2025

Choose a reason for hiding this comment