Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigger GC for actors when they tell the cycle detector they're blocked #3278

Merged
merged 8 commits into from
Aug 8, 2022

Conversation

dipinhora
Copy link
Contributor

@dipinhora dipinhora commented Aug 11, 2019

Prior to this commit, if an actor blocked, it did not run GC to free
any memory it no longer needed. This would result in blocked actors
holding on to (potentially lots of) memory unnecessarily.

This commit causes GC to be triggered when the cycle detector asks
an actor if it is blocked and the actor responds telling the cycle
detector that it is blocked. This should result in memory being
held by blocked actors to be freed more quickly even if the cycle
detector doesn't end up detecting a cycle and reaping the actors.

This will force a GC for an actor based on the following three things:

  • The actor processed at least one message since it's last GC (i.e. it did some work [GC acquire/release message or app message)
  • The actor's heap is greater than 0 (i.e. it has memory that could potentially be freed)
  • The actor is blocked and is about to tell the cycle detector that it is blocked (i.e. it thinks it has no more work to do at the moment)

The sequence of events for GC'ing when sending a block message to the CD is:

  1. actor gets a message from another actor
  2. gets rescheduled because it processed an application message
  3. next run has an empty queue (and the actor gets marked internally as blocked but doesn't send a block message to the CD)
  4. some time passes and the CD eventually asks the actor if it is blocked
  5. the actor garbage collects because of this change before sending the block message to the CD (to prevent race conditions)
  6. the actor responds to the CD by sending a block message

This shouldn't be a performance hit because step 4 is based on how often the CD runs (not very often) along with the fact that the CD doesn't ask all actors it knows about if they're blocked on every run, it asks them in batches instead and so step 4 will not occur very frequently for any actor even if steps 1 - 3 happen regularly.

@SeanTAllen
Copy link
Member

@dipinhora looks like there is a race condition here that is triggering an assert condition that could result in a segfault. otherwise, the general idea looks good.

@dipinhora
Copy link
Contributor Author

@SeanTAllen is there a reason you believe the race condition is a result of the changes in this PR? i see the CI failure with the assertion on shutdown as part of the alpine-lib-llvm-debug-static circle ci job but nothing in this PR is touching any of that code or impacting it meaningfully in any way.

@SeanTAllen
Copy link
Member

Is that not a new assertion failure @dipinhora? I don't believe I've seen it before.

@dipinhora
Copy link
Contributor Author

@SeanTAllen it is the first time i've seen it also. however, that doesn't mean it is caused by the changes in this PR.

@SeanTAllen
Copy link
Member

I believe that view->perceived is related to the cycle detector and gc which suggests to me that it is related. Perhaps we should run the debug tests repeatedly and see if we can get it to trigger with this change and without?

@dipinhora
Copy link
Contributor Author

view->perceived has to do with the internal workings of the cycle detector. the cycle detector has nothing to do with actor gc. this pr is only tangentially related to the cycle detector. it does not modify any logic within the cycle detector. nor does it introduce a new message type for the cycle detector and actors to deal with. it causes gc to be triggered for an actor when a specific message type is received by an actor (which happens to come from the cycle detector). while this pr may have exposed the race condition that caused the assertion to fail, it is not the cause of it.

note: i am not suggesting that the issue exposed by the failed circle ci job not be investigated and fixed. i believe it should be investigated and fixed; just not as part of this pr.

@SeanTAllen
Copy link
Member

@dipinhora what is the scenario where this helps? I'm wondering about possible downsides to this, in particular the cost of tracing and gc'ing in the scenarios where blocking happens a lot. I think that is an unlikely scenario but it is more unlikely than whatever the case that this would help?

@sylvanc
Copy link
Contributor

sylvanc commented Aug 27, 2019

If this change were to happen, the GC needs to happen on the actor's current turn. This implementation sets the actor up to GC on it's next turn, i.e. after it has unblocked.

The heuristic should not be based on messages received since last GC, but rather a more aggressive form of the "normal" GC heuristic. That is, "collect at 25% growth instead of 50% growth" (just an example).

@dipinhora
Copy link
Contributor Author

@SeanTAllen see the second half of #3031 (comment) from the original PR where this was broken out from for an example.

Also, I don't understand your comment about the scenarios where blocking happens a lot. It is possible you're thinking of actors blocking and send messages to the cycle detector as soon as they block which is how things used to work until #2709 was merged a little over a year ago. Now, actors only send block messages to the cycle detector when the cycle detector asks actors if they're blocked and the cycle detector only asks actors this if it doesn't think they are blocked and only on a periodic basis controlled by --ponycdinterval which defaults to once every 100 ms. This PR would only trigger GC in those cases when the cycle detector asks an actor if it is blocked and the actor sends a "yes. i'm blocked" message to the cycle detector.

@dipinhora
Copy link
Contributor Author

@sylvanc thanks for pointing out the bug/oversight about the GC not running during the actor's current turn. I forgot that non-app messages do not cause try_gc to be called and so a request from the cycle detector asking a blocked actor if it is blocked would not cause GC to run. I'll fix this shortly.

The heuristic should not be based on messages received since last GC, but rather a more aggressive form of the "normal" GC heuristic. That is, "collect at 25% growth instead of 50% growth" (just an example).

This PR doesn't have any heuristics involving # of messages processed since last GC. It only relies on three things:

  • The actor processed at least one application message since it's last GC (i.e. it did some work)
  • The actor's heap is greater than 0 (i.e. it has memory that could potentially be freed)
  • The actor is blocked and is about to tell the cycle detector that it is blocked (i.e. it thinks it has no more work to do at the moment)

You're likely thinking of the other change from the original PR (#3031) where this was broken out from.

@SeanTAllen
Copy link
Member

@dipinhora shall I pick up work on this PR?

@dipinhora
Copy link
Contributor Author

@SeanTAllen rebased/updated with comments to explain what/why. Let me know if you have any questions about the changes or the comments.

@SeanTAllen
Copy link
Member

@dipinhora do you think this should have a changelog entry and release notes?

@dipinhora
Copy link
Contributor Author

@SeanTAllen not sure.. this isn't really a user facing thing.. but it is a runtime behavior change similar in some ways to your recent short lived actor reaping changes.. and i think that has a changelog/release notes so i suppose this should also? 8*/

@SeanTAllen
Copy link
Member

Not user facing but..

Perhaps of interest.

I'm leaning towards no

@dipinhora
Copy link
Contributor Author

works for me

@SeanTAllen
Copy link
Member

Current CI failures in this PR are related to a bug that is fixed in #3666.

Copy link
Member

@SeanTAllen SeanTAllen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple small changes I would like to see to this as I noted.

Otherwise, now that I fully understand the changes that you made to the cycle detector last year Dipin, I think this is a reasonable PR.

I can see some scenarios where this might result in better memory usage.

src/libponyrt/actor/actor.c Show resolved Hide resolved
src/libponyrt/actor/actor.c Outdated Show resolved Hide resolved
src/libponyrt/actor/actor.c Outdated Show resolved Hide resolved
@SeanTAllen
Copy link
Member

Reasons I think this is probably a good trade off between performance and memory usage.

The gc is only done IF the actor in question didn't receive an application message. So the only time this should happen is when an actor receives "is blocked" message from cycle detector and there were no messages from other actors during the run.

This should I suspect be rare. It is likely in that case that the actor might hold on to any garbage it has for a long period of time. That could be a large amount of memory.

This will be a performance hit in a scenario like:

  • actor gets a message from another actor
  • gets rescheduled because it processed an application message
  • next run has an empty queue
  • garbage collects because of this change

@SeanTAllen
Copy link
Member

@dipinhora this raises an interesting gc edge-case. for me at least, is the following true currently (i think it is).

  • an actor allocates say 1 gig of memory. it is using it all, gc is run, it still holds 1 gig.
  • actor receives a message telling it that it isn't needed anymore.
  • actor doesn't allocate enough memory to push heap_used past next_gc
  • actor holds on to unneeded memory for potentially a very long time

with your change this problem still exists when the cycle detector is off. and with your change, can this problem still exist because heap used was 0?

something like we have a field that is (TonsOfMemory | None) and we set to None, so heap used is still 0 and we don't try to garbage collect "TonsOfMemory".

@dipinhora
Copy link
Contributor Author

Reasons I think this is probably a good trade off between performance and memory usage.

The gc is only done IF the actor in question didn't receive an application message. So the only time this should happen is when an actor receives "is blocked" message from cycle detector and there were no messages from other actors during the run.

This should I suspect be rare. It is likely in that case that the actor might hold on to any garbage it has for a long period of time. That could be a large amount of memory.

This will be a performance hit in a scenario like:

  • actor gets a message from another actor
  • gets rescheduled because it processed an application message
  • next run has an empty queue
  • garbage collects because of this change

I don't believe this will be a performance hit. The performance hit scenario you've outline is missing an important detail. The following is what would happen in your scenario with the detail added:

The scenario would be:

  1. actor gets a message from another actor
  2. gets rescheduled because it processed an application message
  3. next run has an empty queue (and the actor gets marked internally as blocked but doesn't send a block message to the CD)
  4. some time passes and the CD eventually asks the actor if it is blocked
  5. the actor responds to the CD by sending a block message
  6. the actor garbage collects because of this change after sending the block message to the CD

I don't believe this will be a performance hit because step 4 is based on how often the CD runs (not very often) along with the fact that the CD doesn't ask all actors it knows about if they're blocked on every run, it asks them in batches instead and so step 4 will not occur very frequently for any actor even if steps 1 - 3 happen regularly.

@SeanTAllen
Copy link
Member

can you rebase against master to pick up the race condition fix?

@dipinhora
Copy link
Contributor Author

rebased

@dipinhora
Copy link
Contributor Author

Retrying fixed the freebsd CI failure..

so, the question is, was the segfault a result of the changes in this PR or not? any thoughts?

@SeanTAllen
Copy link
Member

I've never seen that error before but it could be an existing issue we happened to see. I'm ok with merging and keeping an eye out for similar problems.

@dipinhora
Copy link
Contributor Author

hmm... if it's something you haven't seen before then lets wait on the merge while i pretend to be a computer and try and figure out if i've accidentally introduced a race condition.. and if i still can't think of anything then maybe we go with the merge and monitor approach..

@SeanTAllen
Copy link
Member

Sounds like a plan

@dipinhora
Copy link
Contributor Author

In theory moving the GC to occur before sending the block message should prevent race conditions that would exist if GC is done after sending a block message.

This is almost ready (i still need to update that first comment)... i'll drop a note whenever that's done.

@dipinhora
Copy link
Contributor Author

@SeanTAllen i've updated the first comment.. hopefully it's a good enough summary of the conversation..

@jemc
Copy link
Member

jemc commented Aug 2, 2022

@SeanTAllen - does your approval from 2020 still stand or do you want to review again?

@dipinhora
Copy link
Contributor Author

rebased to resolve conflict

@dipinhora
Copy link
Contributor Author

anyone have any ideas why the windows CI build would time out running tests? i don't believe it's related to this change..

@SeanTAllen
Copy link
Member

The timeouts aren't happening with other PRs so I think it is related.

@dipinhora
Copy link
Contributor Author

Considering that there were no timeouts prior to the rebase (all tests passed), and that this PR changes absolutely nothing related to atomics used by the schedulers or the quiescence logic, and only contains a minor change related to triggering an extra GC cycle in certain scenarios, i'm finding it hard to understand how this PR is related to the timeouts.. i'm happy to look deeper if you can point to specific changes in the PR that you think might be suspect but i'm drawing a blank otherwise..

@dipinhora
Copy link
Contributor Author

another datapoint: the windows CI timed out in the last commit for PR #4164 also.

@SeanTAllen
Copy link
Member

I'm most interested now in why this one PR always exhibits the windows hang. I suspect that whatever the occasional trigger elsewhere, it always happens with these changes.

dipinhora and others added 8 commits August 7, 2022 17:57
Prior to this commit, if an actor blocked, it did not run GC to free
any memory it no longer needed. This would result in blocked actors
holding on to (potentially lots of) memory unnecessarily.

This commit causes GC to be triggered when the cycle detector asks
an actor if it is blocked and the actor responds telling the cycle
detector that it is blocked and also when the actor initially tells
the cycle detector that it is blocked. This should result in memory
being held by blocked actors to be freed more quickly even if the
cycle detector doesn't end up detecting a cycle and reaping the
actors.
Review pointed out that this is unnecessary because the same logic
is already accounted for in `ponyint_heap_startgc` already.
@dipinhora
Copy link
Contributor Author

I'm most interested now in why this one PR always exhibits the windows hang. I suspect that whatever the occasional trigger elsewhere, it always happens with these changes.

Not sure why you think this one PR always exhibits the windows hang when, as i stated in my comment yesterday, there were no timeouts prior to the rebase (all tests passed) (maybe you're getting this PR confused with #4151 where we saw many windows hangs?).. Additionally, the issue that this PR exhibited previously was a FreeBSD crash (nothing Windows related) which was resolved..

Regardless, i've rebased again to pick up latest main changes.. Let's roll the dice and see if the Windows hang occurs again or not.

@SeanTAllen
Copy link
Member

Because this now fails every time I restart it. It's not occasional.

@dipinhora
Copy link
Contributor Author

every time I restart it

That sounds like something important that i wasn't aware of (or somehow missed) previously..

@dipinhora
Copy link
Contributor Author

looks like the windows Ci timeout issue isn't limited to this PR or #4164.. it just happened on #3282 also.

@dipinhora
Copy link
Contributor Author

Funny.. it passed successfully this time on this PR..

@SeanTAllen SeanTAllen merged commit e38717d into ponylang:main Aug 8, 2022
@ponylang-main ponylang-main removed the discuss during sync Should be discussed during an upcoming sync label Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants