-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NPUW] Fix memory consumption with bank uids #28382
base: master
Are you sure you want to change the base?
[NPUW] Fix memory consumption with bank uids #28382
Conversation
// uid may be coming from a 2nd (3rd, ...) model | ||
// detach the tensor here just in case | ||
const_cast<LazyTensor&>(iter_device->second.lt).detach(); | ||
// Also detach from registered tensors | ||
auto it_registered = device_bank.registered_tensors.find(iter_device->second.lt); | ||
NPUW_ASSERT(it_registered != device_bank.registered_tensors.end()); | ||
const_cast<LazyTensor&>(it_registered->first).detach(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously our .get()
had effects. If the requested LazyTensor
hasn't been evaluated yet, it is evaluated and copied to the device memory. And we did .detach()
right after.
After the recent changes, .get()
is just .get()
. So here's the question - if there's no allocation happens at this point, why what has to be detached hasn't been detached at this point yet?
It seems we have too many .detaches()
now - it just means our flow is not very clear.
// Also detach from registered tensors | ||
auto it_registered = device_bank.registered_tensors.find(iter_device->second.lt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't this happen on evaluate_and_allocate()
?
// Also detach from registered tensors | ||
auto it_registered = dbank.registered_tensors.find(tensor); | ||
NPUW_ASSERT(it_registered != dbank.registered_tensors.end()); | ||
const_cast<LazyTensor&>(it_registered->first).detach(); | ||
|
||
return allocated_tensor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we've had before the changes:
- A
LazyTensor
was a key to the device tensor (for the transition fromlazy_closure
toclosure
) - Kvcache and Prefill models had its own individual LazyTensors (as distinct objects), populated from the original weights independently
- When a compiled model queried a real tensor for the closure, it passed
LazyTensor
as a key and Bank called.detatch()
in exchange - That is, the memory referred to twice was detached first in kvcache and then in prefill model, so finally cleared.
What do you have now:
- A
LazyTensor
is a key to ID - First kvcache model exchanges its
LazyTensors
to IDs - this is where IDs are registered and LTs are collected inside - Then kvcache model evaluates the full bank - this is the point where LazyTensors should be detached
- After that prefill model comes with its own
LazyTensors
and there's a hit, same IDs are returned - Evaluate on the bank essentially does nothing (and by the way you didn't optimize this part per past review).
- We end up with memory leaked as the 2nd set of tensors was not detached.
How does detaching the existing registered tensors in the bank helps the problem? I believe this is what you catch in the second get, but in this case your internal storage still refers to the first set of LTs, not the second one.
The right solution is here:
- Detach the LT in the bank once it is evaluated
- On registerLT, detach the incoming LT if it was found it the bank. That's it.
|
||
return allocated_tensor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has a return value but nobody is using it. I'd drop the body of this function completely - it will be easier to do all of this in the parent loop.
Introduced in #28282