Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decay TDigest #198

Open
rnataf opened this issue Aug 31, 2022 · 4 comments
Open

Decay TDigest #198

rnataf opened this issue Aug 31, 2022 · 4 comments

Comments

@rnataf
Copy link

rnataf commented Aug 31, 2022

Hi, I am interested in a time decayed version of TDigest. I saw there was a discussion about it in issue #127. Has there been any progress about this problem ?

@tdunning
Copy link
Owner

tdunning commented Sep 1, 2022 via email

@rnataf
Copy link
Author

rnataf commented Sep 1, 2022

Hi Ted, thank you for your answer.
I want to make sure I understand you correctly:

  • My first idea was to each time period dividing by some factor the weight of all the centroids. This way, the more the time goes, the less old data has influence. If I understand you correctly, this may be a bad way since in such case we are not ensured that the centroids in tails contain/represent a single element and then it is not clear which accuracy we will get. - Conversely, we could also multiply the weight of a new entry (or simulate inserting several times the same entry) in order to give more importance to new data. This way, it would still hold that the centroids at the tail contain a single entry.
  • What you're suggesting though is to use several instances of tDigest, each one for a defined time period and then merging them. Is that correct ? If so, I don't see how it would enable to simulate a time decay. Can you enlighten me a little more please ?

@tdunning
Copy link
Owner

tdunning commented Sep 2, 2022

Yes. You have the first point correct. If a centroid has one sample, then it decays to 1/2, another sample, and it decays by a factor of 2/3, we have a weight of 1, but this represents two samples. Normally, we would use this specially for interpolation purposes to gain accuracy, but we can't do that any more.

One fix for this would be to keep separate counter of number of samples and total weight for a centroid, but once you have at least one sample, then it can never decay to the point where it can be merged if it is in the tail and thus you force other samples to be merged.

Regarding your second point, the most common desire for exponential decay of a digest is so that you can estimate the distribution over a particular (recent) time period. With short time range based digests, you don't have to estimate ... you can just combine the pertinent digests and you have exactly what you want.

If you want some sort of weighting over a time period, you can do that too. When you want to do the quantile or CDF operation you translate into composite operations on each set of digests that have the same weight.

@rnataf
Copy link
Author

rnataf commented Sep 4, 2022

Regarding your last point, I understand that what you suggest to do is to have one digest per time period. Then, I can decrease the weights of "old" digests when I no longer use them. Let's say now I'm in time period 2 and I decreased the weights of digests of time period 0 and 1. When I want to compute the quantile I should merge the digests of periods 0, 1 and 2 and then compute the quantile on the merged result. Did I understand you correctly ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants