Trigger GC for actors when they tell the cycle detector they're blocked #3278

dipinhora · 2019-08-11T01:59:03Z

Prior to this commit, if an actor blocked, it did not run GC to free
any memory it no longer needed. This would result in blocked actors
holding on to (potentially lots of) memory unnecessarily.

This commit causes GC to be triggered when the cycle detector asks
an actor if it is blocked and the actor responds telling the cycle
detector that it is blocked. This should result in memory being
held by blocked actors to be freed more quickly even if the cycle
detector doesn't end up detecting a cycle and reaping the actors.

This will force a GC for an actor based on the following three things:

The actor processed at least one message since it's last GC (i.e. it did some work [GC acquire/release message or app message)
The actor's heap is greater than 0 (i.e. it has memory that could potentially be freed)
The actor is blocked and is about to tell the cycle detector that it is blocked (i.e. it thinks it has no more work to do at the moment)

The sequence of events for GC'ing when sending a block message to the CD is:

actor gets a message from another actor
gets rescheduled because it processed an application message
next run has an empty queue (and the actor gets marked internally as blocked but doesn't send a block message to the CD)
some time passes and the CD eventually asks the actor if it is blocked
the actor garbage collects because of this change before sending the block message to the CD (to prevent race conditions)
the actor responds to the CD by sending a block message

This shouldn't be a performance hit because step 4 is based on how often the CD runs (not very often) along with the fact that the CD doesn't ask all actors it knows about if they're blocked on every run, it asks them in batches instead and so step 4 will not occur very frequently for any actor even if steps 1 - 3 happen regularly.

SeanTAllen · 2019-08-20T16:40:51Z

@dipinhora looks like there is a race condition here that is triggering an assert condition that could result in a segfault. otherwise, the general idea looks good.

dipinhora · 2019-08-20T22:30:35Z

@SeanTAllen is there a reason you believe the race condition is a result of the changes in this PR? i see the CI failure with the assertion on shutdown as part of the alpine-lib-llvm-debug-static circle ci job but nothing in this PR is touching any of that code or impacting it meaningfully in any way.

SeanTAllen · 2019-08-20T22:43:14Z

Is that not a new assertion failure @dipinhora? I don't believe I've seen it before.

dipinhora · 2019-08-20T22:44:19Z

@SeanTAllen it is the first time i've seen it also. however, that doesn't mean it is caused by the changes in this PR.

SeanTAllen · 2019-08-20T23:15:09Z

I believe that view->perceived is related to the cycle detector and gc which suggests to me that it is related. Perhaps we should run the debug tests repeatedly and see if we can get it to trigger with this change and without?

dipinhora · 2019-08-21T00:25:24Z

view->perceived has to do with the internal workings of the cycle detector. the cycle detector has nothing to do with actor gc. this pr is only tangentially related to the cycle detector. it does not modify any logic within the cycle detector. nor does it introduce a new message type for the cycle detector and actors to deal with. it causes gc to be triggered for an actor when a specific message type is received by an actor (which happens to come from the cycle detector). while this pr may have exposed the race condition that caused the assertion to fail, it is not the cause of it.

note: i am not suggesting that the issue exposed by the failed circle ci job not be investigated and fixed. i believe it should be investigated and fixed; just not as part of this pr.

SeanTAllen · 2019-08-27T16:36:22Z

@dipinhora what is the scenario where this helps? I'm wondering about possible downsides to this, in particular the cost of tracing and gc'ing in the scenarios where blocking happens a lot. I think that is an unlikely scenario but it is more unlikely than whatever the case that this would help?

sylvanc · 2019-08-27T16:39:30Z

If this change were to happen, the GC needs to happen on the actor's current turn. This implementation sets the actor up to GC on it's next turn, i.e. after it has unblocked.

The heuristic should not be based on messages received since last GC, but rather a more aggressive form of the "normal" GC heuristic. That is, "collect at 25% growth instead of 50% growth" (just an example).

dipinhora · 2019-08-28T00:44:08Z

@SeanTAllen see the second half of #3031 (comment) from the original PR where this was broken out from for an example.

Also, I don't understand your comment about the scenarios where blocking happens a lot. It is possible you're thinking of actors blocking and send messages to the cycle detector as soon as they block which is how things used to work until #2709 was merged a little over a year ago. Now, actors only send block messages to the cycle detector when the cycle detector asks actors if they're blocked and the cycle detector only asks actors this if it doesn't think they are blocked and only on a periodic basis controlled by --ponycdinterval which defaults to once every 100 ms. This PR would only trigger GC in those cases when the cycle detector asks an actor if it is blocked and the actor sends a "yes. i'm blocked" message to the cycle detector.

dipinhora · 2019-08-28T00:44:34Z

@sylvanc thanks for pointing out the bug/oversight about the GC not running during the actor's current turn. I forgot that non-app messages do not cause try_gc to be called and so a request from the cycle detector asking a blocked actor if it is blocked would not cause GC to run. I'll fix this shortly.

The heuristic should not be based on messages received since last GC, but rather a more aggressive form of the "normal" GC heuristic. That is, "collect at 25% growth instead of 50% growth" (just an example).

This PR doesn't have any heuristics involving # of messages processed since last GC. It only relies on three things:

The actor processed at least one application message since it's last GC (i.e. it did some work)
The actor's heap is greater than 0 (i.e. it has memory that could potentially be freed)
The actor is blocked and is about to tell the cycle detector that it is blocked (i.e. it thinks it has no more work to do at the moment)

You're likely thinking of the other change from the original PR (#3031) where this was broken out from.

SeanTAllen · 2020-09-25T03:02:17Z

@dipinhora shall I pick up work on this PR?

dipinhora · 2020-09-26T02:51:20Z

@SeanTAllen rebased/updated with comments to explain what/why. Let me know if you have any questions about the changes or the comments.

SeanTAllen · 2020-09-26T02:57:08Z

@dipinhora do you think this should have a changelog entry and release notes?

dipinhora · 2020-09-26T03:05:26Z

@SeanTAllen not sure.. this isn't really a user facing thing.. but it is a runtime behavior change similar in some ways to your recent short lived actor reaping changes.. and i think that has a changelog/release notes so i suppose this should also? 8*/

SeanTAllen · 2020-09-26T03:11:26Z

Not user facing but..

Perhaps of interest.

I'm leaning towards no

dipinhora · 2020-09-26T03:12:40Z

works for me

SeanTAllen · 2020-09-26T12:56:20Z

Current CI failures in this PR are related to a bug that is fixed in #3666.

SeanTAllen

There's a couple small changes I would like to see to this as I noted.

Otherwise, now that I fully understand the changes that you made to the cycle detector last year Dipin, I think this is a reasonable PR.

I can see some scenarios where this might result in better memory usage.

src/libponyrt/actor/actor.c

SeanTAllen · 2020-09-26T13:19:14Z

Reasons I think this is probably a good trade off between performance and memory usage.

The gc is only done IF the actor in question didn't receive an application message. So the only time this should happen is when an actor receives "is blocked" message from cycle detector and there were no messages from other actors during the run.

This should I suspect be rare. It is likely in that case that the actor might hold on to any garbage it has for a long period of time. That could be a large amount of memory.

This will be a performance hit in a scenario like:

actor gets a message from another actor
gets rescheduled because it processed an application message
next run has an empty queue
garbage collects because of this change

SeanTAllen · 2020-09-26T13:32:33Z

@dipinhora this raises an interesting gc edge-case. for me at least, is the following true currently (i think it is).

an actor allocates say 1 gig of memory. it is using it all, gc is run, it still holds 1 gig.
actor receives a message telling it that it isn't needed anymore.
actor doesn't allocate enough memory to push heap_used past next_gc
actor holds on to unneeded memory for potentially a very long time

with your change this problem still exists when the cycle detector is off. and with your change, can this problem still exist because heap used was 0?

something like we have a field that is (TonsOfMemory | None) and we set to None, so heap used is still 0 and we don't try to garbage collect "TonsOfMemory".

dipinhora · 2020-09-26T16:04:17Z

Reasons I think this is probably a good trade off between performance and memory usage.

The gc is only done IF the actor in question didn't receive an application message. So the only time this should happen is when an actor receives "is blocked" message from cycle detector and there were no messages from other actors during the run.

This should I suspect be rare. It is likely in that case that the actor might hold on to any garbage it has for a long period of time. That could be a large amount of memory.

This will be a performance hit in a scenario like:

actor gets a message from another actor

gets rescheduled because it processed an application message

next run has an empty queue

garbage collects because of this change

I don't believe this will be a performance hit. The performance hit scenario you've outline is missing an important detail. The following is what would happen in your scenario with the detail added:

The scenario would be:

actor gets a message from another actor
gets rescheduled because it processed an application message
next run has an empty queue (and the actor gets marked internally as blocked but doesn't send a block message to the CD)
some time passes and the CD eventually asks the actor if it is blocked
the actor responds to the CD by sending a block message
the actor garbage collects because of this change after sending the block message to the CD

I don't believe this will be a performance hit because step 4 is based on how often the CD runs (not very often) along with the fact that the CD doesn't ask all actors it knows about if they're blocked on every run, it asks them in batches instead and so step 4 will not occur very frequently for any actor even if steps 1 - 3 happen regularly.

SeanTAllen · 2020-09-26T16:04:29Z

can you rebase against master to pick up the race condition fix?

dipinhora · 2020-09-26T16:10:39Z

rebased

dipinhora · 2022-07-12T01:16:40Z

Retrying fixed the freebsd CI failure..

so, the question is, was the segfault a result of the changes in this PR or not? any thoughts?

SeanTAllen · 2022-07-12T01:19:28Z

I've never seen that error before but it could be an existing issue we happened to see. I'm ok with merging and keeping an eye out for similar problems.

dipinhora · 2022-07-12T01:27:35Z

hmm... if it's something you haven't seen before then lets wait on the merge while i pretend to be a computer and try and figure out if i've accidentally introduced a race condition.. and if i still can't think of anything then maybe we go with the merge and monitor approach..

SeanTAllen · 2022-07-12T01:36:20Z

Sounds like a plan

dipinhora · 2022-07-13T13:25:28Z

In theory moving the GC to occur before sending the block message should prevent race conditions that would exist if GC is done after sending a block message.

This is almost ready (i still need to update that first comment)... i'll drop a note whenever that's done.

dipinhora · 2022-07-15T04:08:22Z

@SeanTAllen i've updated the first comment.. hopefully it's a good enough summary of the conversation..

jemc · 2022-08-02T18:22:35Z

@SeanTAllen - does your approval from 2020 still stand or do you want to review again?

dipinhora · 2022-08-05T23:51:33Z

rebased to resolve conflict

dipinhora · 2022-08-06T15:19:13Z

anyone have any ideas why the windows CI build would time out running tests? i don't believe it's related to this change..

SeanTAllen · 2022-08-06T17:25:58Z

The timeouts aren't happening with other PRs so I think it is related.

dipinhora · 2022-08-06T18:44:14Z

Considering that there were no timeouts prior to the rebase (all tests passed), and that this PR changes absolutely nothing related to atomics used by the schedulers or the quiescence logic, and only contains a minor change related to triggering an extra GC cycle in certain scenarios, i'm finding it hard to understand how this PR is related to the timeouts.. i'm happy to look deeper if you can point to specific changes in the PR that you think might be suspect but i'm drawing a blank otherwise..

dipinhora · 2022-08-07T21:42:38Z

another datapoint: the windows CI timed out in the last commit for PR #4164 also.

SeanTAllen · 2022-08-07T21:54:46Z

I'm most interested now in why this one PR always exhibits the windows hang. I suspect that whatever the occasional trigger elsewhere, it always happens with these changes.

Prior to this commit, if an actor blocked, it did not run GC to free any memory it no longer needed. This would result in blocked actors holding on to (potentially lots of) memory unnecessarily. This commit causes GC to be triggered when the cycle detector asks an actor if it is blocked and the actor responds telling the cycle detector that it is blocked and also when the actor initially tells the cycle detector that it is blocked. This should result in memory being held by blocked actors to be freed more quickly even if the cycle detector doesn't end up detecting a cycle and reaping the actors.

Co-authored-by: Sean T Allen <[email protected]>

Review pointed out that this is unnecessary because the same logic is already accounted for in `ponyint_heap_startgc` already.

dipinhora · 2022-08-07T22:04:11Z

I'm most interested now in why this one PR always exhibits the windows hang. I suspect that whatever the occasional trigger elsewhere, it always happens with these changes.

Not sure why you think this one PR always exhibits the windows hang when, as i stated in my comment yesterday, there were no timeouts prior to the rebase (all tests passed) (maybe you're getting this PR confused with #4151 where we saw many windows hangs?).. Additionally, the issue that this PR exhibited previously was a FreeBSD crash (nothing Windows related) which was resolved..

Regardless, i've rebased again to pick up latest main changes.. Let's roll the dice and see if the Windows hang occurs again or not.

SeanTAllen · 2022-08-07T22:08:25Z

Because this now fails every time I restart it. It's not occasional.

dipinhora · 2022-08-07T23:27:06Z

every time I restart it

That sounds like something important that i wasn't aware of (or somehow missed) previously..

dipinhora · 2022-08-08T00:29:36Z

looks like the windows Ci timeout issue isn't limited to this PR or #4164.. it just happened on #3282 also.

dipinhora · 2022-08-08T01:28:07Z

Funny.. it passed successfully this time on this PR..

dipinhora requested a review from SeanTAllen August 11, 2019 01:59

dipinhora mentioned this pull request Aug 11, 2019

Trigger GC after a set number of messages #3031

Closed

dipinhora force-pushed the gc_trigger_blocked branch from 179ff61 to 4a548dc Compare September 26, 2020 02:49

dipinhora force-pushed the gc_trigger_blocked branch from 4a548dc to 717f06e Compare September 26, 2020 03:03

SeanTAllen requested changes Sep 26, 2020

View reviewed changes

src/libponyrt/actor/actor.c Show resolved Hide resolved

src/libponyrt/actor/actor.c Outdated Show resolved Hide resolved

src/libponyrt/actor/actor.c Outdated Show resolved Hide resolved

SeanTAllen reviewed Sep 26, 2020

View reviewed changes

src/libponyrt/actor/actor.c Outdated Show resolved Hide resolved

dipinhora force-pushed the gc_trigger_blocked branch from 7611045 to 7bf1c23 Compare September 26, 2020 16:09

SeanTAllen approved these changes Sep 26, 2020

View reviewed changes

dipinhora force-pushed the gc_trigger_blocked branch from 6946167 to e41ba86 Compare August 5, 2022 23:51

dipinhora and others added 8 commits August 7, 2022 17:57

Update comment based on review feedback

00ff101

Co-authored-by: Sean T Allen <[email protected]>

Revert check for heap.used == 0 in triggergc

ddd289d

Review pointed out that this is unnecessary because the same logic is already accounted for in `ponyint_heap_startgc` already.

Move assertion up a few lines.

48aac86

update comment based on review

cee8dfd

Make sure to GC *before* sending ACTORMSG_BLOCK to CD

eb4bb47

Relocate try_gc

78c06af

force CI

88724b4

dipinhora force-pushed the gc_trigger_blocked branch from e41ba86 to 88724b4 Compare August 7, 2022 21:57

SeanTAllen merged commit e38717d into ponylang:main Aug 8, 2022

ponylang-main removed the discuss during sync Should be discussed during an upcoming sync label Aug 8, 2022

Trigger GC for actors when they tell the cycle detector they're blocked #3278

Trigger GC for actors when they tell the cycle detector they're blocked #3278

Conversation

dipinhora commented Aug 11, 2019 • edited Loading

SeanTAllen commented Aug 20, 2019

dipinhora commented Aug 20, 2019

SeanTAllen commented Aug 20, 2019

dipinhora commented Aug 20, 2019

SeanTAllen commented Aug 20, 2019

dipinhora commented Aug 21, 2019

SeanTAllen commented Aug 27, 2019

sylvanc commented Aug 27, 2019

dipinhora commented Aug 28, 2019

dipinhora commented Aug 28, 2019

SeanTAllen commented Sep 25, 2020

dipinhora commented Sep 26, 2020

SeanTAllen commented Sep 26, 2020

dipinhora commented Sep 26, 2020

SeanTAllen commented Sep 26, 2020

dipinhora commented Sep 26, 2020

SeanTAllen commented Sep 26, 2020

SeanTAllen left a comment

Choose a reason for hiding this comment

SeanTAllen commented Sep 26, 2020

SeanTAllen commented Sep 26, 2020

dipinhora commented Sep 26, 2020

SeanTAllen commented Sep 26, 2020

dipinhora commented Sep 26, 2020

dipinhora commented Jul 12, 2022

SeanTAllen commented Jul 12, 2022

dipinhora commented Jul 12, 2022

SeanTAllen commented Jul 12, 2022

dipinhora commented Jul 13, 2022

dipinhora commented Jul 15, 2022

jemc commented Aug 2, 2022

dipinhora commented Aug 5, 2022

dipinhora commented Aug 6, 2022

SeanTAllen commented Aug 6, 2022

dipinhora commented Aug 6, 2022

dipinhora commented Aug 7, 2022

SeanTAllen commented Aug 7, 2022

dipinhora commented Aug 7, 2022

SeanTAllen commented Aug 7, 2022

dipinhora commented Aug 7, 2022

dipinhora commented Aug 8, 2022

dipinhora commented Aug 8, 2022

dipinhora commented Aug 11, 2019 •

edited

Loading