Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NPUW] Fix memory consumption with bank uids #28382

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

smirnov-alexey
Copy link
Contributor

Introduced in #28282

Comment on lines 70 to +76
// uid may be coming from a 2nd (3rd, ...) model
// detach the tensor here just in case
const_cast<LazyTensor&>(iter_device->second.lt).detach();
// Also detach from registered tensors
auto it_registered = device_bank.registered_tensors.find(iter_device->second.lt);
NPUW_ASSERT(it_registered != device_bank.registered_tensors.end());
const_cast<LazyTensor&>(it_registered->first).detach();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously our .get() had effects. If the requested LazyTensor hasn't been evaluated yet, it is evaluated and copied to the device memory. And we did .detach() right after.

After the recent changes, .get() is just .get(). So here's the question - if there's no allocation happens at this point, why what has to be detached hasn't been detached at this point yet?

It seems we have too many .detaches() now - it just means our flow is not very clear.

Comment on lines +73 to +74
// Also detach from registered tensors
auto it_registered = device_bank.registered_tensors.find(iter_device->second.lt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this happen on evaluate_and_allocate() ?

Comment on lines +144 to 149
// Also detach from registered tensors
auto it_registered = dbank.registered_tensors.find(tensor);
NPUW_ASSERT(it_registered != dbank.registered_tensors.end());
const_cast<LazyTensor&>(it_registered->first).detach();

return allocated_tensor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we've had before the changes:

  1. A LazyTensor was a key to the device tensor (for the transition from lazy_closure to closure)
  2. Kvcache and Prefill models had its own individual LazyTensors (as distinct objects), populated from the original weights independently
  3. When a compiled model queried a real tensor for the closure, it passed LazyTensor as a key and Bank called .detatch() in exchange
  4. That is, the memory referred to twice was detached first in kvcache and then in prefill model, so finally cleared.

What do you have now:

  1. A LazyTensor is a key to ID
  2. First kvcache model exchanges its LazyTensors to IDs - this is where IDs are registered and LTs are collected inside
  3. Then kvcache model evaluates the full bank - this is the point where LazyTensors should be detached
  4. After that prefill model comes with its own LazyTensors and there's a hit, same IDs are returned
  5. Evaluate on the bank essentially does nothing (and by the way you didn't optimize this part per past review).
  6. We end up with memory leaked as the 2nd set of tensors was not detached.

How does detaching the existing registered tensors in the bank helps the problem? I believe this is what you catch in the second get, but in this case your internal storage still refers to the first set of LTs, not the second one.

The right solution is here:

  1. Detach the LT in the bank once it is evaluated
  2. On registerLT, detach the incoming LT if it was found it the bank. That's it.

Comment on lines +148 to 149

return allocated_tensor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has a return value but nobody is using it. I'd drop the body of this function completely - it will be easier to do all of this in the parent loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin Code Freeze
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants