-
-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track Event Loop intervals in dashboard plot #5964
Conversation
mrocklin
commented
Mar 19, 2022
•
edited
Loading
edited
Unit Test Results 12 files ±0 12 suites ±0 6h 15m 11s ⏱️ + 6m 53s For more details on these failures, see this check. Results for commit 55e8090. ± Comparison against base commit 5c7d555. ♻️ This comment has been updated with latest results. |
cc @graingert if you have time to take a quick look |
@mrocklin can you add the corresponding entry to this doc-page please https://github.com/dask/distributed/blob/main/docs/source/http_services.rst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I pulled down the branch and tested it out. Worked as expected on a handful of the notebook in the dask-tutorial and dask-exeamples repos
Merging in 24 hours if there are not further comments |
The tick counter itself seems problematic tracked here #5908 I'd rather not build anything on top of it until we've determined a replacement for it |
If it's problematic for tests (I'd like to learn more about how) then let's fix it in tests (#5977 is one simple attempt at this). I'm open to replacements of this structure if anyone has ideas. If no one has ideas then I'm inclined to push forward though. This is important to track system health. First things first though, any suggestions on alternatives? I'd love to find something better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am skeptical that when the event loop is blocked, bokeh will actually be able to send updates to the dashboard showing that this is happening (i.e. the chart may either be flickering around 0.02, or frozen at 0.02, but not go above 0.02 except for a brief, instantaneous high flicker right after the loop un-blocks). I haven't tested this—could you confirm that (time.sleep
in an async client.run_on_scheduler
, etc.)?
If the dashboard doesn't adequately show this, maybe we should change the visualization to something that would show the delay post-hoc (a timeseries of loop intervals?). Ideally, we'd have the plot increment itself client-side (JS), and then zero itself back out when it receives an update from the scheduler.
measure_tick_interval = parse_timedelta( | ||
self._tick_counter = 0 | ||
self._tick_count = 0 | ||
self._tick_count_last = time() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use a monotonic timer here #4528
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently use metrics.time because of windows support. If we want to switch to something else that's better I'm all in favor, but I hesitate to launch into this immediately. Instead I'd rather that we kept with the standard way of doing things, and then considered switching out to monotonic en-masse. Thoughts?
distributed/core.py
Outdated
pc = PeriodicCallback(self._measure_tick, self._tick_interval * 1000) | ||
self.periodic_callbacks["tick"] = pc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pc = PeriodicCallback(self._measure_tick, self._tick_interval * 1000) | |
self.periodic_callbacks["tick"] = pc | |
self.periodic_callbacks["tick"] = PeriodicCallback(self._measure_tick, self._tick_interval * 1000) |
pc
is unused otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done. I also did this for the lines above.
@@ -97,6 +98,7 @@ | |||
"/individual-compute-time-per-key": individual_doc(ComputePerKey, 500), | |||
"/individual-aggregate-time-per-action": individual_doc(AggregateAction, 500), | |||
"/individual-scheduler-system": individual_doc(SystemMonitor, 500), | |||
"/individual-event-loop": individual_doc(EventLoop, 500), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the interval for this plot correspond to the tick cycle? Otherwise it will run unnecessarily often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty cheap. I don't think that this matters much.
toolbar_location="above", | ||
**kwargs, | ||
) | ||
self.root.vbar(x="names", top="values", width=0.9, source=self.source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I'd like to see color as well (yellow if the interval is over a certain threshold, like the worker memory plot). I think 0.1s would be a reasonable threshold; this is asyncio's default slow_callback_duration
when debug mode is turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the ideal. However, I'd like to get this in and focus elsewhere for now if possible.
You're right to be skeptical, when the scheduler event loop is blocked then certainly the dashboard won't update. However we will get the signal afterwards. We track the amount of time since the last time we cycled things, and use that. For workers that are missing I'm actually considering integrating this value into the other bar charts, like memory. We have a |
I'm worried the signal will be nearly immediately overridden by the next tick measurement cycle. What if the scheduler event loop is blocked for 10s straight (as it is most times I'm really excited to have visibility into this, but I'm worried that a visualization that's biased towards only showing healthy event loop times is even worse than no indicator at all. |
In practice this thing is super valuable. I'd encourage folks to try it out before raising too many concerns.
Again, in practice I'm getting good signal from this thing. At one point I started doing an EWMA on it with some lighter transparency bar. I think that that would give you what you're looking for. A timeseries would also be good. All of these ideas are good. I've got what I need here though and I'm focused on other things. I think that this is a strict improvement over current state, doesn't add significant complexity (I think), and so I think that it should go in. If you're saying "Yes, this is cool, in the future it would be even cooler if it did X, Y, and Z" then I'm entirely aligned. If you're saying "This is cool, but to go in it needs to be cooler and do X, Y, and Z" then I think that that would engage a conversation about prioritization, and whether or not X, Y, and Z are the highest priority things at the moment. If you're saying "This isn't cool, and is a degredation on current behavior" then let's pause and chat about it. My experience using this is that it's very helpful. It's not perfect, but I don't think that that's necessary. |
I'm saying "I'd like to see a GIF of what a single 3sec pause of the scheduler event loop looks like with this chart." I definitely agree this is helpful for shuffling. But there are other cases in which we're also worried about event loop blocking:
I think all of those cases above would be things this plot would show clearly, since they'd be brief but repeated interruptions instead of long but rare interruptions. So I also imagine this will be an improvement. But I still want to understand how misleading this will be on other cases before being sure of that judgement. |
I'm not sure if this is what you're asking, but it's unlikely that I take the time to form that GIF. You're absolutely right though that this won't catch everything. If that's your concern then I absolutely cede the point. In practice, smooth sailing, followed by a large pause, followed by smooth sailing isn't as common, I don't think, as general cruddy performance for an extended period of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with both @gjoseph92 concern and @mrocklin urge to move forward.
I agree that many use cases would actually want to have a more detailed visualization, e.g. by using an additional EWMA as suggested. I think the use case of high performing extensions (e.g. stealing, shuffling) is where this is good enough but for loop health debugging in "ordinary" circumstances, I can easily see that we need more.
However, the first iteration doesn't need to be perfect. I think it's perfectly fine to ask for more but if @mrocklin is not willing to add more to this, that's also fine. In almost all situations it should be OK to not increase scope of a PR if the additional scope are improvements and not an actual problems. From what I understand, most raised concerns are enhancements on this PR and we can follow up. Is this understanding correct?
Re: "we should get rid of the measure-tick thing"
The way I see it, this system existed for a while now and this PR improves visibility on this metric. Whether this metric is a problem or not can be discussed. We're not stating this anywhere but I would consider most individual-*
plots to be experimental. If we remove the internal instrumentation supporting a given plot, we'll just remove the plot together with the instrumentation.
I had a minor comment about the worker heartbeat but this is not a blocker. If there is nothing else, we're good to go from my end
Woot. Green tests. |
I think your tests haven't even been scheduled, yet. There seems to be something wrong with GH actions :) |
My concern was about misleading users, or getting misleading bug reports like "the scheduler isn't doing anything and I see event loop blocked errors, but the event loop health plot looks fine." If this were going to be on a highly-visible page like |
Any objection to merging? |