-
-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trigger GC for actors when they tell the cycle detector they're blocked #3278
Conversation
@dipinhora looks like there is a race condition here that is triggering an assert condition that could result in a segfault. otherwise, the general idea looks good. |
@SeanTAllen is there a reason you believe the race condition is a result of the changes in this PR? i see the CI failure with the assertion on shutdown as part of the |
Is that not a new assertion failure @dipinhora? I don't believe I've seen it before. |
@SeanTAllen it is the first time i've seen it also. however, that doesn't mean it is caused by the changes in this PR. |
I believe that view->perceived is related to the cycle detector and gc which suggests to me that it is related. Perhaps we should run the debug tests repeatedly and see if we can get it to trigger with this change and without? |
note: i am not suggesting that the issue exposed by the failed circle ci job not be investigated and fixed. i believe it should be investigated and fixed; just not as part of this pr. |
@dipinhora what is the scenario where this helps? I'm wondering about possible downsides to this, in particular the cost of tracing and gc'ing in the scenarios where blocking happens a lot. I think that is an unlikely scenario but it is more unlikely than whatever the case that this would help? |
If this change were to happen, the GC needs to happen on the actor's current turn. This implementation sets the actor up to GC on it's next turn, i.e. after it has unblocked. The heuristic should not be based on messages received since last GC, but rather a more aggressive form of the "normal" GC heuristic. That is, "collect at 25% growth instead of 50% growth" (just an example). |
@SeanTAllen see the second half of #3031 (comment) from the original PR where this was broken out from for an example. Also, I don't understand your comment about |
@sylvanc thanks for pointing out the bug/oversight about the GC not running during the actor's current turn. I forgot that non-app messages do not cause
This PR doesn't have any heuristics involving # of messages processed since last GC. It only relies on three things:
You're likely thinking of the other change from the original PR (#3031) where this was broken out from. |
@dipinhora shall I pick up work on this PR? |
179ff61
to
4a548dc
Compare
@SeanTAllen rebased/updated with comments to explain what/why. Let me know if you have any questions about the changes or the comments. |
@dipinhora do you think this should have a changelog entry and release notes? |
4a548dc
to
717f06e
Compare
@SeanTAllen not sure.. this isn't really a user facing thing.. but it is a runtime behavior change similar in some ways to your recent short lived actor reaping changes.. and i think that has a changelog/release notes so i suppose this should also? 8*/ |
Not user facing but.. Perhaps of interest. I'm leaning towards no |
works for me |
Current CI failures in this PR are related to a bug that is fixed in #3666. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a couple small changes I would like to see to this as I noted.
Otherwise, now that I fully understand the changes that you made to the cycle detector last year Dipin, I think this is a reasonable PR.
I can see some scenarios where this might result in better memory usage.
Reasons I think this is probably a good trade off between performance and memory usage. The gc is only done IF the actor in question didn't receive an application message. So the only time this should happen is when an actor receives "is blocked" message from cycle detector and there were no messages from other actors during the run. This should I suspect be rare. It is likely in that case that the actor might hold on to any garbage it has for a long period of time. That could be a large amount of memory. This will be a performance hit in a scenario like:
|
@dipinhora this raises an interesting gc edge-case. for me at least, is the following true currently (i think it is).
with your change this problem still exists when the cycle detector is off. and with your change, can this problem still exist because heap used was 0? something like we have a field that is (TonsOfMemory | None) and we set to None, so heap used is still 0 and we don't try to garbage collect "TonsOfMemory". |
I don't believe this will be a performance hit. The performance hit scenario you've outline is missing an important detail. The following is what would happen in your scenario with the detail added: The scenario would be:
I don't believe this will be a performance hit because step 4 is based on how often the CD runs (not very often) along with the fact that the CD doesn't ask all actors it knows about if they're blocked on every run, it asks them in batches instead and so step 4 will not occur very frequently for any actor even if steps 1 - 3 happen regularly. |
can you rebase against master to pick up the race condition fix? |
7611045
to
7bf1c23
Compare
rebased |
Retrying fixed the so, the question is, was the |
I've never seen that error before but it could be an existing issue we happened to see. I'm ok with merging and keeping an eye out for similar problems. |
hmm... if it's something you haven't seen before then lets wait on the merge while i pretend to be a computer and try and figure out if i've accidentally introduced a race condition.. and if i still can't think of anything then maybe we go with the merge and monitor approach.. |
Sounds like a plan |
In theory moving the GC to occur before sending the This is almost ready (i still need to update that first comment)... i'll drop a note whenever that's done. |
@SeanTAllen i've updated the first comment.. hopefully it's a good enough summary of the conversation.. |
@SeanTAllen - does your approval from 2020 still stand or do you want to review again? |
6946167
to
e41ba86
Compare
rebased to resolve conflict |
anyone have any ideas why the windows CI build would time out running tests? i don't believe it's related to this change.. |
The timeouts aren't happening with other PRs so I think it is related. |
Considering that there were no timeouts prior to the rebase (all tests passed), and that this PR changes absolutely nothing related to atomics used by the schedulers or the quiescence logic, and only contains a minor change related to triggering an extra GC cycle in certain scenarios, i'm finding it hard to understand how this PR is related to the timeouts.. i'm happy to look deeper if you can point to specific changes in the PR that you think might be suspect but i'm drawing a blank otherwise.. |
another datapoint: the windows CI timed out in the last commit for PR #4164 also. |
I'm most interested now in why this one PR always exhibits the windows hang. I suspect that whatever the occasional trigger elsewhere, it always happens with these changes. |
Prior to this commit, if an actor blocked, it did not run GC to free any memory it no longer needed. This would result in blocked actors holding on to (potentially lots of) memory unnecessarily. This commit causes GC to be triggered when the cycle detector asks an actor if it is blocked and the actor responds telling the cycle detector that it is blocked and also when the actor initially tells the cycle detector that it is blocked. This should result in memory being held by blocked actors to be freed more quickly even if the cycle detector doesn't end up detecting a cycle and reaping the actors.
Co-authored-by: Sean T Allen <[email protected]>
Review pointed out that this is unnecessary because the same logic is already accounted for in `ponyint_heap_startgc` already.
e41ba86
to
88724b4
Compare
Not sure why you think Regardless, i've rebased again to pick up latest |
Because this now fails every time I restart it. It's not occasional. |
That sounds like something important that i wasn't aware of (or somehow missed) previously.. |
Funny.. it passed successfully this time on this PR.. |
Prior to this commit, if an actor blocked, it did not run GC to free
any memory it no longer needed. This would result in blocked actors
holding on to (potentially lots of) memory unnecessarily.
This commit causes GC to be triggered when the cycle detector asks
an actor if it is blocked and the actor responds telling the cycle
detector that it is blocked. This should result in memory being
held by blocked actors to be freed more quickly even if the cycle
detector doesn't end up detecting a cycle and reaping the actors.
This will force a GC for an actor based on the following three things:
The sequence of events for GC'ing when sending a block message to the CD is:
This shouldn't be a performance hit because step 4 is based on how often the CD runs (not very often) along with the fact that the CD doesn't ask all actors it knows about if they're blocked on every run, it asks them in batches instead and so step 4 will not occur very frequently for any actor even if steps 1 - 3 happen regularly.