Correcting for "Coordinated Omission" #128

ajwerner · 2019-05-12T18:54:52Z

Background on this issue

This isn't specifically a problem with the t-digest and does not apply to the primary use-case for the t-digest. I'm creating this issue mostly as a forum for discussion. I have an ongoing goal to push the world of application monitoring towards a data-structure that provides high precision quantile estimation over extremely large ranges of values without manual configuration. In order to do this, there are a number of details which need to be worked out to supplant the existing dominant technique in this space which is the HDR Histogram. Practically speaking, with good configuration, the HDR Histogram is a very effective data structure. My qualms with it are:

the configuration requirement
the size-precision trade-off that must be decided at configuration time

In order to make the t-digest useful in the metrics context a number of details need to be sorted out. Some of the issues related to measurement are tracked in #127.

"Coordinated Omission"

Gil Tene in this talk introduced the concept of "coordinated omission" and a common error made when trying to understand tail latency of systems. makes the valuable observation that in the context of application latency monitoring, there is often a relatively fixed concurrency. This means that in response to application stalls, the throughput during the period of the stall will be dramatically lowered. Say for example you have one worker calling some function and you want to understand the performance characteristics of that function. You might imagine writing code that takes a timestamp before calling the function, taking a timestamp after and recording the elapsed duration. Let''s say our computer occasionally stalls (e.g. stop-the-world GC) for some amount of time. Let's say the function always takes 10ms except when the system is stalled at which time it takes is the amount of the stall the request experiences + 10ms. If we run this benchmark where the one thread constantly calls the function and during the run exactly 1 stall occurs. If we look at the data we'll see exactly one data point that saw latency above 10ms. If the test ran for a long time that means that tail latency even pretty far out in the tail will be reported as 10ms.

The mistake here is not the data, per se, but the metric the user perceives to be tracking. The user probably wants to understand the shape of the distribution for a request that arrives in the system at an arbitrary point in time but instead they are measuring the distribution for a series of requests issued serially from a connection (or multiple, the problem plagues all fixed concurrency workloads measured this way). The below image clarifies the point.

Correcting for "coordinated omission"

Fortunately the technique used in the HDR Histogram to correct for coordinated omission can be directly applied to the t-digest. The basic idea is that a user picks some value that they "expect" an operation to take and then the code will take all of the data points which exceed that expected quantity and adds new values to the histogram decreasing from the bucket value at by the expected value until from the value is lower than the expected value. This same approach can be applied either while recording data into the histogram (addWhileCorrectingForCoordinatedOmission) or after the fact (copyCorrectedForCoordinatedOmission)

When t-digest counts are assumed to be one-per request then an almost identical technique could be used. Of course, this transformation makes the counts somewhat meaningless. The t-digest can probably help users pick a good "expected value" but I'll leave that for later.

Correcting for Coordinated Omission

The straight-forward action item I propose is to provide a mechanism similar to org.HdrHistogram.AbstractHistogram.copyCorrectedForCoordinatedOmission for t-digests. Sure, one could already implement the logic to do the correction during addition time if they knew the expected interval but that would require up front configuration which is, for me at least, a major selling point of the t-digest.

The technique used by the HdrHistogram is far from perfect for monitoring use cases. I suspect that this is a reason that even though the HdrHistogram get wide use, people don't exploit its coordinated omission correction features as often as one might expect.

One unfortunate property of the correction approach is that it takes time on the order of the number of correction points that need to be added to the data structure. This could be a very large number in cases of large stalls. Any operations which aren't >O(centroids) in a t-digest could be problematic for online monitoring use-cases. Is there a way to more efficiently add the weight due to the correction points?

When the HdrHistogram does its correction, it adds the values ordered by diminishing weight, are there biasing problems due to the sequential order of these additions?

Can we come up with a technique to add the weight over the interval that only costs O(centroids which need to be updated) and decrease the bias introduced along the way?

Correcting t-digests with changing weights

The point I want to discuss is how to think about this correction factor when weights are not 1 per request but rather might be decaying as in #127 or something like it. I'm curious if I can spark interest in thinking about how one can determine how to think about correcting t-digests appropriately for coordinated omission.

As an additional curiosity, I wonder if, by recording additional state in each centroid to correctly perform on-line decaying (as presented in Exponential weighted averages with irregular sampling), could we then detect changes in throughput due to changes in latency without changes in concurrency? By observing from the additional state that the middle of the distribution contains older data and that tail centroid contain newer data and assuming that the values stored represent durations, could it be possible to approximate whether/how to perform a correction for coordinate omission?

tdunning · 2019-05-12T22:10:06Z

So you bring up several points. On the point of coordinated omission, if the expected time is known either in advance or retrospectively, then I think that Gil's ad hoc correction or something very much like it could be done at query time at very low cost. The problem is that the correction really only has an operational definition and the mathematical effect of the correction is kind of obscure. I haven't looked in detail, but from your description, it sounds like every sample above the expected value is taken as ceiling( log(x / x_expected) ) samples that are exponentially distributed down to x_expected. I think that this could just as well be phrased as a convolution of the observed mass with the exponential distribution when you query it. Implementation would proceed by treating every centroid above x_expected as evidence of a truncated exponential. That means that you could compute the cumulative distribution at any point by adding these distributions for all centroids greater than the value of x. I do have a question about this, however, because it introduces a discontinuity in the CDF at x_expected. That doesn't make sense. Your idea about correlating number of requests against latency is actually pretty good, but I really think that the right way to do that is to persist copies of the t-digest at roughly the same rate that you persist number of transactions. This lets you do a real time series analysis. Trying to do any real mathematics against a decaying t-digest is just not a good idea.

…

On Sun, May 12, 2019 at 11:54 AM ajwerner ***@***.***> wrote: Background on this issue This isn't specifically a problem with the t-digest and does not apply to the primary use-case for the t-digest. I'm creating this issue mostly as a forum for discussion. I have an ongoing goal to push the world of application monitoring towards a data-structure that provides high precision quantile estimation over extremely large ranges of values without manual configuration. In order to do this, there are a number of details which need to be worked out to supplant the existing dominant technique in this space which is the HDR Histogram. Practically speaking, with good configuration, the HDR Histogram is a very effective data structure. My qualms with it are: 1. the configuration requirement 2. the size-precision trade-off that must be decided at configuration time In order to make the t-digest useful in the metrics context a number of details need to be sorted out. Some of the issues related to measurement are tracked in #127 <#127>. "Coordinated Omission" Gil Tene in this talk <https://www.azul.com/files/HowNotToMeasureLatency_LLSummit_NYC_12Nov2013.pdf> introduced the concept of "coordinated omission" and a common error made when trying to understand tail latency of systems. makes the valuable observation that in the context of application latency monitoring, there is often a relatively fixed concurrency. This means that in response to application stalls, the throughput during the period of the stall will be dramatically lowered. Say for example you have one worker calling some function and you want to understand the performance characteristics of that function. You might imagine writing code that takes a timestamp before calling the function, taking a timestamp after and recording the elapsed duration. Let''s say our computer occasionally stalls (e.g. stop-the-world GC) for some amount of time. Let's say the function always takes 10ms except when the system is stalled at which time it takes is the amount of the stall the request experiences + 10ms. If we run this benchmark where the one thread constantly calls the function and during the run exactly 1 stall occurs. If we look at the data we'll see exactly one data point that saw latency above 10ms. If the test ran for a long time that means that tail latency even pretty far out in the tail will be reported as 10ms. The mistake here is not the data, per se, but the metric the user perceives to be tracking. The user probably wants to understand the shape of the distribution for a request that arrives in the system at an arbitrary point in time but instead they are measuring the distribution for a series of requests issued serially from a connection (or multiple, the problem plagues all fixed concurrency workloads measured this way). The below image clarifies the point. [image: Screenshot 2019-05-12 at 2 09 40 PM] <https://user-images.githubusercontent.com/1839234/57586153-a4903300-74bf-11e9-9492-ba442695a6c3.png> Correcting for "coordinated omission" Fortunately the technique used in the HDR Histogram to correct for coordinated omission can be directly applied to the t-digest. The basic idea is that a user picks some value that they "expect" an operation to take and then the code will take all of the data points which exceed that expected quantity and adds new values to the histogram decreasing from the bucket value at by the expected value until from the value is lower than the expected value. This same approach can be applied either while recording data into the histogram ( addWhileCorrectingForCoordinatedOmission <https://hdrhistogram.github.io/HdrHistogram/JavaDoc/org/HdrHistogram/AbstractHistogram.html#addWhileCorrectingForCoordinatedOmission-org.HdrHistogram.AbstractHistogram-long->) or after the fact (copyCorrectedForCoordinatedOmission <https://hdrhistogram.github.io/HdrHistogram/JavaDoc/org/HdrHistogram/AbstractHistogram.html#copyCorrectedForCoordinatedOmission-long-> ) When t-digest counts are assumed to be one-per request then an almost identical technique could be used. Of course, this transformation makes the counts somewhat meaningless. The t-digest can probably help users pick a good "expected value" but I'll leave that for later. Correcting for Coordinated Omission The straight-forward action item I propose is to provide a mechanism similar to org.HdrHistogram.AbstractHistogram.copyCorrectedForCoordinatedOmission <https://hdrhistogram.github.io/HdrHistogram/JavaDoc/org/HdrHistogram/AbstractHistogram.html#copyCorrectedForCoordinatedOmission-long-> for t-digests. Sure, one could already implement the logic to do the correction during addition time if they knew the expected interval but that would require up front configuration which is, for me at least, a major selling point of the t-digest. The technique used by the HdrHistogram is far from perfect for monitoring use cases. I suspect that this is a reason that even though the HdrHistogram get wide use, people don't exploit its coordinated omission correction features as often as one might expect. One unfortunate property of the correction approach is that it takes time on the order of the number of correction points that need to be added to the data structure. This could be a very large number in cases of large stalls. Any operations which aren't >O(centroids) in a t-digest could be problematic for online monitoring use-cases. Is there a way to more efficiently add the weight due to the correction points? When the HdrHistogram does its correction, it adds the values ordered by diminishing weight, are there biasing problems due to the sequential order of these additions? Can we come up with a technique to add the weight over the interval that only costs O(centroids which need to be updated) and decrease the bias introduced along the way? Correcting t-digests with changing weights The point I want to discuss is how to think about this correction factor when weights are not 1 per request but rather might be decaying as in #127 <#127> or something like it. I'm curious if I can spark interest in thinking about how one can determine how to think about correcting t-digests appropriately for coordinated omission. As an additional curiosity, I wonder if, by recording additional state in each centroid to correctly perform on-line decaying (as presented in Exponential weighted averages with irregular sampling <http://tdunning.blogspot.com/2011/03/exponential-weighted-averages-with.html>), could we then detect changes in throughput due to changes in latency without changes in concurrency? By observing from the additional state that the middle of the distribution contains older data and that tail centroid contain newer data and assuming that the values stored represent durations, could it be possible to approximate whether/how to perform a correction for coordinate omission? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#128>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAB5E6RHG6HCLCWGR5HIWZTPVBRX3ANCNFSM4HMK76JQ> .

ajwerner · 2019-05-14T03:31:47Z

I really think that the right way to do that is to persist
copies of the t-digest at roughly the same rate that you persist number of
transactions.

In a sense doesn't the t-digest store the number of transactions as the total weight?

Trying to do any real mathematics against a decaying t-digest is just not a good idea.

I guess the question of whether trying to reason about the decaying t-digests was a good idea the theme of both this and #127, so thanks for taking a clear stance on that. I suppose if I take that advice to heart I need to re-think my approach to monitoring a bit.

I'm going to look at storing and tracking hierarchical windows of historical t-digests.

tdunning · 2021-01-18T23:05:47Z

Reviving this discussion so that we can drive to consensus.

Regarding coordinated omission, I think that you can get the desired result by simply averaging CDFs for constant time windows without paying attention to the number of samples.

Regarding decay, I am very clear that there isn't a good way to decay a single t-digest and also convinced that the price of storing them is low enough that you can get any desired effect that way. The hierarchical windows idea is a good way to go there. Further, because merging is cheap and easy, each level of the hierarchy can be a large factor smaller than the previous one. If you willing to have 100:1 ratio in windows and willing to merge up to 1000 digests, then if your smallest window is 1 minute, a single level of accumulation would allow you to look at a digest for 100,000 minutes > 2 months. THat means that time dynamics on the level of decades is easily achievable.

I will leave this open for a few more days if there is no discusion.

tdunning · 2021-04-02T03:56:43Z

Seeing no further discussino, I am closing in advance of the 3.3 release

tdunning closed this as completed Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correcting for "Coordinated Omission" #128

Correcting for "Coordinated Omission" #128

ajwerner commented May 12, 2019

tdunning commented May 12, 2019 via email

ajwerner commented May 14, 2019

tdunning commented Jan 18, 2021

tdunning commented Apr 2, 2021

Correcting for "Coordinated Omission" #128

Correcting for "Coordinated Omission" #128

Comments

ajwerner commented May 12, 2019

Background on this issue

"Coordinated Omission"

Correcting for "coordinated omission"

Correcting for Coordinated Omission

Correcting t-digests with changing weights

tdunning commented May 12, 2019 via email

ajwerner commented May 14, 2019

tdunning commented Jan 18, 2021

tdunning commented Apr 2, 2021