kvs: dangling fence requests can eat up memory in kvs forever #6112

chu11 · 2024-07-17T18:26:47Z

while working on #6049 it occurred to me that "dangling" fence requests eat up memory forever. e.g. if only 3 out of 4 fence requests are ever sent, then that transaction and the messages sit in the kvs forever waiting for the 4th fence request to arrive.

in the disconnect callback, we could minimally cleanup the message response so that doesn't eat up memory and we don't send out a response to someone that is not waiting for a response
there is no mechanism to 'timeout' a fence transaction or cancel it. so it sits there forever
- if all present waiters disconnect, should we cancel transaction?
- or have configurable timeout via heartbeats or something like that could cancel it?

chu11 · 2025-01-14T23:32:35Z

hmmm, was doing some brainstorming / review on this and it's trickier than originally thought

fence requests on ranks != 0 get forwarded to rank 0, so we sort of don't "track" fence requests on rank != 0
we could be able to track those requests and clean up up when a disconnect is detected
but for rank != 0 we actually don't know if the fence is completed or not, that is basically unknown to us until the transaction is completed
so if we were to go with a "disconnect" mechanism for solving this, we presumably have to communicate with rank 0 for info on the fence "status"

it certainly makes me think / believe we should have some timeout mechanism on fence requests. perhaps similar and related to #6124 if we go down that route

chu11 · 2025-01-28T21:52:31Z

A very "simple" solution to this would have been to support a fence-timeout configuration of sorts and simply check if any fences are sitting around where "(current_time - fence_start_time) > fence_timeout" when we get a heartbeat.

However, that would require us to iterate through all kvsroots. Per a prior issue #3633, we really would like to avoid iterating through all the kvs roots regularly. Granted doing this on heartbeats vs prep/check is not as bad, but it's still costly.

One thought is to have a "fence timeout checker" that wakes up once in awhile that isn't tied to heartbeats, lets say every 5 or 10 minutes. It'll iterate through all kvs roots and do this check then. So overall it's not too costly, with the tradeoff that a fence-timeout doesn't mean something will get cleaned up 1 second after the timeout, it could be 5-10 minutes after the timeout.

garlick · 2025-01-28T22:21:59Z

At this point maybe we should explore removing the fence code if the only user is the alternate pmi kvs implementation that nobody uses. Same with the barrier module I think?

chu11 · 2025-01-28T23:16:55Z

At this point maybe we should explore removing the fence code if the only user is the alternate pmi kvs implementation that nobody uses. Same with the barrier module I think?

I did ponder this after working on #6581.

My initial inclination was A) there is code that uses it B) it is minimally "neat" and potentially useful down the road and C) it's possible someone is using it??? (super low odds of course)

But perhaps the odds of its use is low enough, it would simplify a non-trivial amount of code in the KVS and solve some problems.

I guess we could remove it and see what the fallout is???

chu11 · 2025-01-29T19:32:18Z

Thinking about this a bit more, removing the fence could would clean things up a lot. Minimally, the whole "transaction" internal API exists only b/c of commits vs fences (i.e. a commit is a special fence of just 1 request). A ridiculous amount of code could probably be collapsed. (leading to issue #6570)

chu11 · 2025-01-31T06:39:04Z

closing, with #6587 removing fence support, this issue is no longer relevant

chu11 mentioned this issue Jul 17, 2024

kvs: return ENOSYS on unfinished requests #6049

Merged

chu11 mentioned this issue Oct 25, 2024

kvs: investigate and prevent potential denial of service #6125

Open

chu11 mentioned this issue Jan 22, 2025

kvs: could use some refactoring #6570

Open

chu11 self-assigned this Jan 27, 2025

This was referenced Jan 29, 2025

kvs: remove fence support #6587

Closed

barrier: remove module #6588

Open

kvs: remove fence support #6592

Merged

chu11 closed this as completed Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvs: dangling fence requests can eat up memory in kvs forever #6112

kvs: dangling fence requests can eat up memory in kvs forever #6112

chu11 commented Jul 17, 2024

chu11 commented Jan 14, 2025

chu11 commented Jan 28, 2025

garlick commented Jan 28, 2025

chu11 commented Jan 28, 2025

chu11 commented Jan 29, 2025 •

edited

Loading

chu11 commented Jan 31, 2025

kvs: dangling fence requests can eat up memory in kvs forever #6112

kvs: dangling fence requests can eat up memory in kvs forever #6112

Comments

chu11 commented Jul 17, 2024

chu11 commented Jan 14, 2025

chu11 commented Jan 28, 2025

garlick commented Jan 28, 2025

chu11 commented Jan 28, 2025

chu11 commented Jan 29, 2025 • edited Loading

chu11 commented Jan 31, 2025

chu11 commented Jan 29, 2025 •

edited

Loading