-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correcting for "Coordinated Omission" #128
Comments
So you bring up several points.
On the point of coordinated omission, if the expected time is known either
in advance or retrospectively, then I think that Gil's ad hoc correction or
something very much like it could be done at query time at very low cost.
The problem is that the correction really only has an operational
definition and the mathematical effect of the correction is kind of obscure.
I haven't looked in detail, but from your description, it sounds like every
sample above the expected value is taken as ceiling( log(x / x_expected) )
samples that are exponentially distributed down to x_expected. I think that
this could just as well be phrased as a convolution of the observed mass
with the exponential distribution when you query it. Implementation would
proceed by treating every centroid above x_expected as evidence of a
truncated exponential. That means that you could compute the cumulative
distribution at any point by adding these distributions for all centroids
greater than the value of x.
I do have a question about this, however, because it introduces a
discontinuity in the CDF at x_expected. That doesn't make sense.
Your idea about correlating number of requests against latency is actually
pretty good, but I really think that the right way to do that is to persist
copies of the t-digest at roughly the same rate that you persist number of
transactions. This lets you do a real time series analysis. Trying to do
any real mathematics against a decaying t-digest is just not a good idea.
…On Sun, May 12, 2019 at 11:54 AM ajwerner ***@***.***> wrote:
Background on this issue
This isn't specifically a problem with the t-digest and does not apply to
the primary use-case for the t-digest. I'm creating this issue mostly as a
forum for discussion. I have an ongoing goal to push the world of
application monitoring towards a data-structure that provides high
precision quantile estimation over extremely large ranges of values without
manual configuration. In order to do this, there are a number of details
which need to be worked out to supplant the existing dominant technique in
this space which is the HDR Histogram. Practically speaking, with good
configuration, the HDR Histogram is a very effective data structure. My
qualms with it are:
1. the configuration requirement
2. the size-precision trade-off that must be decided at configuration
time
In order to make the t-digest useful in the metrics context a number of
details need to be sorted out. Some of the issues related to measurement
are tracked in #127 <#127>.
"Coordinated Omission"
Gil Tene in this talk
<https://www.azul.com/files/HowNotToMeasureLatency_LLSummit_NYC_12Nov2013.pdf>
introduced the concept of "coordinated omission" and a common error made
when trying to understand tail latency of systems. makes the valuable
observation that in the context of application latency monitoring, there is
often a relatively fixed concurrency. This means that in response to
application stalls, the throughput during the period of the stall will be
dramatically lowered. Say for example you have one worker calling some
function and you want to understand the performance characteristics of that
function. You might imagine writing code that takes a timestamp before
calling the function, taking a timestamp after and recording the elapsed
duration. Let''s say our computer occasionally stalls (e.g. stop-the-world
GC) for some amount of time. Let's say the function always takes 10ms
except when the system is stalled at which time it takes is the amount of
the stall the request experiences + 10ms. If we run this benchmark where
the one thread constantly calls the function and during the run exactly 1
stall occurs. If we look at the data we'll see exactly one data point that
saw latency above 10ms. If the test ran for a long time that means that
tail latency even pretty far out in the tail will be reported as 10ms.
The mistake here is not the data, per se, but the metric the user
perceives to be tracking. The user probably wants to understand the shape
of the distribution for a request that arrives in the system at an
arbitrary point in time but instead they are measuring the distribution for
a series of requests issued serially from a connection (or multiple, the
problem plagues all fixed concurrency workloads measured this way). The
below image clarifies the point.
[image: Screenshot 2019-05-12 at 2 09 40 PM]
<https://user-images.githubusercontent.com/1839234/57586153-a4903300-74bf-11e9-9492-ba442695a6c3.png>
Correcting for "coordinated omission"
Fortunately the technique used in the HDR Histogram to correct for
coordinated omission can be directly applied to the t-digest. The basic
idea is that a user picks some value that they "expect" an operation to
take and then the code will take all of the data points which exceed that
expected quantity and adds new values to the histogram decreasing from the
bucket value at by the expected value until from the value is lower than
the expected value. This same approach can be applied either while
recording data into the histogram (
addWhileCorrectingForCoordinatedOmission
<https://hdrhistogram.github.io/HdrHistogram/JavaDoc/org/HdrHistogram/AbstractHistogram.html#addWhileCorrectingForCoordinatedOmission-org.HdrHistogram.AbstractHistogram-long->)
or after the fact (copyCorrectedForCoordinatedOmission
<https://hdrhistogram.github.io/HdrHistogram/JavaDoc/org/HdrHistogram/AbstractHistogram.html#copyCorrectedForCoordinatedOmission-long->
)
When t-digest counts are assumed to be one-per request then an almost
identical technique could be used. Of course, this transformation makes the
counts somewhat meaningless. The t-digest can probably help users pick a
good "expected value" but I'll leave that for later.
Correcting for Coordinated Omission
The straight-forward action item I propose is to provide a mechanism
similar to
org.HdrHistogram.AbstractHistogram.copyCorrectedForCoordinatedOmission
<https://hdrhistogram.github.io/HdrHistogram/JavaDoc/org/HdrHistogram/AbstractHistogram.html#copyCorrectedForCoordinatedOmission-long->
for t-digests. Sure, one could already implement the logic to do the
correction during addition time if they knew the expected interval but that
would require up front configuration which is, for me at least, a major
selling point of the t-digest.
The technique used by the HdrHistogram is far from perfect for monitoring
use cases. I suspect that this is a reason that even though the
HdrHistogram get wide use, people don't exploit its coordinated omission
correction features as often as one might expect.
One unfortunate property of the correction approach is that it takes time
on the order of the number of correction points that need to be added to
the data structure. This could be a very large number in cases of large
stalls. Any operations which aren't >O(centroids) in a t-digest could be
problematic for online monitoring use-cases. Is there a way to more
efficiently add the weight due to the correction points?
When the HdrHistogram does its correction, it adds the values ordered by
diminishing weight, are there biasing problems due to the sequential order
of these additions?
Can we come up with a technique to add the weight over the interval that
only costs O(centroids which need to be updated) and decrease the bias
introduced along the way?
Correcting t-digests with changing weights
The point I want to discuss is how to think about this correction factor
when weights are not 1 per request but rather might be decaying as in #127
<#127> or something like it.
I'm curious if I can spark interest in thinking about how one can determine
how to think about correcting t-digests appropriately for coordinated
omission.
As an additional curiosity, I wonder if, by recording additional state in
each centroid to correctly perform on-line decaying (as presented in Exponential
weighted averages with irregular sampling
<http://tdunning.blogspot.com/2011/03/exponential-weighted-averages-with.html>),
could we then detect changes in throughput due to changes in latency
without changes in concurrency? By observing from the additional state that
the middle of the distribution contains older data and that tail centroid
contain newer data and assuming that the values stored represent durations,
could it be possible to approximate whether/how to perform a correction for
coordinate omission?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#128>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAB5E6RHG6HCLCWGR5HIWZTPVBRX3ANCNFSM4HMK76JQ>
.
|
In a sense doesn't the t-digest store the number of transactions as the total weight?
I guess the question of whether trying to reason about the decaying t-digests was a good idea the theme of both this and #127, so thanks for taking a clear stance on that. I suppose if I take that advice to heart I need to re-think my approach to monitoring a bit. I'm going to look at storing and tracking hierarchical windows of historical t-digests. |
Reviving this discussion so that we can drive to consensus. Regarding coordinated omission, I think that you can get the desired result by simply averaging CDFs for constant time windows without paying attention to the number of samples. Regarding decay, I am very clear that there isn't a good way to decay a single t-digest and also convinced that the price of storing them is low enough that you can get any desired effect that way. The hierarchical windows idea is a good way to go there. Further, because merging is cheap and easy, each level of the hierarchy can be a large factor smaller than the previous one. If you willing to have 100:1 ratio in windows and willing to merge up to 1000 digests, then if your smallest window is 1 minute, a single level of accumulation would allow you to look at a digest for 100,000 minutes > 2 months. THat means that time dynamics on the level of decades is easily achievable. I will leave this open for a few more days if there is no discusion. |
Seeing no further discussino, I am closing in advance of the 3.3 release |
Background on this issue
This isn't specifically a problem with the t-digest and does not apply to the primary use-case for the t-digest. I'm creating this issue mostly as a forum for discussion. I have an ongoing goal to push the world of application monitoring towards a data-structure that provides high precision quantile estimation over extremely large ranges of values without manual configuration. In order to do this, there are a number of details which need to be worked out to supplant the existing dominant technique in this space which is the HDR Histogram. Practically speaking, with good configuration, the HDR Histogram is a very effective data structure. My qualms with it are:
In order to make the t-digest useful in the metrics context a number of details need to be sorted out. Some of the issues related to measurement are tracked in #127.
"Coordinated Omission"
Gil Tene in this talk introduced the concept of "coordinated omission" and a common error made when trying to understand tail latency of systems. makes the valuable observation that in the context of application latency monitoring, there is often a relatively fixed concurrency. This means that in response to application stalls, the throughput during the period of the stall will be dramatically lowered. Say for example you have one worker calling some function and you want to understand the performance characteristics of that function. You might imagine writing code that takes a timestamp before calling the function, taking a timestamp after and recording the elapsed duration. Let''s say our computer occasionally stalls (e.g. stop-the-world GC) for some amount of time. Let's say the function always takes 10ms except when the system is stalled at which time it takes is the amount of the stall the request experiences + 10ms. If we run this benchmark where the one thread constantly calls the function and during the run exactly 1 stall occurs. If we look at the data we'll see exactly one data point that saw latency above 10ms. If the test ran for a long time that means that tail latency even pretty far out in the tail will be reported as 10ms.
The mistake here is not the data, per se, but the metric the user perceives to be tracking. The user probably wants to understand the shape of the distribution for a request that arrives in the system at an arbitrary point in time but instead they are measuring the distribution for a series of requests issued serially from a connection (or multiple, the problem plagues all fixed concurrency workloads measured this way). The below image clarifies the point.
Correcting for "coordinated omission"
Fortunately the technique used in the HDR Histogram to correct for coordinated omission can be directly applied to the t-digest. The basic idea is that a user picks some value that they "expect" an operation to take and then the code will take all of the data points which exceed that expected quantity and adds new values to the histogram decreasing from the bucket value at by the expected value until from the value is lower than the expected value. This same approach can be applied either while recording data into the histogram (addWhileCorrectingForCoordinatedOmission) or after the fact (copyCorrectedForCoordinatedOmission)
When t-digest counts are assumed to be one-per request then an almost identical technique could be used. Of course, this transformation makes the counts somewhat meaningless. The t-digest can probably help users pick a good "expected value" but I'll leave that for later.
Correcting for Coordinated Omission
The straight-forward action item I propose is to provide a mechanism similar to org.HdrHistogram.AbstractHistogram.copyCorrectedForCoordinatedOmission for t-digests. Sure, one could already implement the logic to do the correction during addition time if they knew the expected interval but that would require up front configuration which is, for me at least, a major selling point of the t-digest.
The technique used by the HdrHistogram is far from perfect for monitoring use cases. I suspect that this is a reason that even though the HdrHistogram get wide use, people don't exploit its coordinated omission correction features as often as one might expect.
One unfortunate property of the correction approach is that it takes time on the order of the number of correction points that need to be added to the data structure. This could be a very large number in cases of large stalls. Any operations which aren't >O(centroids) in a t-digest could be problematic for online monitoring use-cases. Is there a way to more efficiently add the weight due to the correction points?
When the HdrHistogram does its correction, it adds the values ordered by diminishing weight, are there biasing problems due to the sequential order of these additions?
Can we come up with a technique to add the weight over the interval that only costs O(centroids which need to be updated) and decrease the bias introduced along the way?
Correcting t-digests with changing weights
The point I want to discuss is how to think about this correction factor when weights are not 1 per request but rather might be decaying as in #127 or something like it. I'm curious if I can spark interest in thinking about how one can determine how to think about correcting t-digests appropriately for coordinated omission.
As an additional curiosity, I wonder if, by recording additional state in each centroid to correctly perform on-line decaying (as presented in Exponential weighted averages with irregular sampling), could we then detect changes in throughput due to changes in latency without changes in concurrency? By observing from the additional state that the middle of the distribution contains older data and that tail centroid contain newer data and assuming that the values stored represent durations, could it be possible to approximate whether/how to perform a correction for coordinate omission?
The text was updated successfully, but these errors were encountered: