Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs: dangling fence requests can eat up memory in kvs forever #6112

Closed
chu11 opened this issue Jul 17, 2024 · 6 comments
Closed

kvs: dangling fence requests can eat up memory in kvs forever #6112

chu11 opened this issue Jul 17, 2024 · 6 comments
Assignees

Comments

@chu11
Copy link
Member

chu11 commented Jul 17, 2024

while working on #6049 it occurred to me that "dangling" fence requests eat up memory forever. e.g. if only 3 out of 4 fence requests are ever sent, then that transaction and the messages sit in the kvs forever waiting for the 4th fence request to arrive.

  • in the disconnect callback, we could minimally cleanup the message response so that doesn't eat up memory and we don't send out a response to someone that is not waiting for a response

  • there is no mechanism to 'timeout' a fence transaction or cancel it. so it sits there forever

    • if all present waiters disconnect, should we cancel transaction?
    • or have configurable timeout via heartbeats or something like that could cancel it?
@chu11
Copy link
Member Author

chu11 commented Jan 14, 2025

hmmm, was doing some brainstorming / review on this and it's trickier than originally thought

  • fence requests on ranks != 0 get forwarded to rank 0, so we sort of don't "track" fence requests on rank != 0
  • we could be able to track those requests and clean up up when a disconnect is detected
  • but for rank != 0 we actually don't know if the fence is completed or not, that is basically unknown to us until the transaction is completed
  • so if we were to go with a "disconnect" mechanism for solving this, we presumably have to communicate with rank 0 for info on the fence "status"

it certainly makes me think / believe we should have some timeout mechanism on fence requests. perhaps similar and related to #6124 if we go down that route

@chu11
Copy link
Member Author

chu11 commented Jan 28, 2025

A very "simple" solution to this would have been to support a fence-timeout configuration of sorts and simply check if any fences are sitting around where "(current_time - fence_start_time) > fence_timeout" when we get a heartbeat.

However, that would require us to iterate through all kvsroots. Per a prior issue #3633, we really would like to avoid iterating through all the kvs roots regularly. Granted doing this on heartbeats vs prep/check is not as bad, but it's still costly.

One thought is to have a "fence timeout checker" that wakes up once in awhile that isn't tied to heartbeats, lets say every 5 or 10 minutes. It'll iterate through all kvs roots and do this check then. So overall it's not too costly, with the tradeoff that a fence-timeout doesn't mean something will get cleaned up 1 second after the timeout, it could be 5-10 minutes after the timeout.

@garlick
Copy link
Member

garlick commented Jan 28, 2025

At this point maybe we should explore removing the fence code if the only user is the alternate pmi kvs implementation that nobody uses. Same with the barrier module I think?

@chu11
Copy link
Member Author

chu11 commented Jan 28, 2025

At this point maybe we should explore removing the fence code if the only user is the alternate pmi kvs implementation that nobody uses. Same with the barrier module I think?

I did ponder this after working on #6581.

My initial inclination was A) there is code that uses it B) it is minimally "neat" and potentially useful down the road and C) it's possible someone is using it??? (super low odds of course)

But perhaps the odds of its use is low enough, it would simplify a non-trivial amount of code in the KVS and solve some problems.

I guess we could remove it and see what the fallout is???

@chu11
Copy link
Member Author

chu11 commented Jan 29, 2025

Thinking about this a bit more, removing the fence could would clean things up a lot. Minimally, the whole "transaction" internal API exists only b/c of commits vs fences (i.e. a commit is a special fence of just 1 request). A ridiculous amount of code could probably be collapsed. (leading to issue #6570)

@chu11
Copy link
Member Author

chu11 commented Jan 31, 2025

closing, with #6587 removing fence support, this issue is no longer relevant

@chu11 chu11 closed this as completed Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants