-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-order terms in division to avoid periodic decimal and rounding issues in Centroid#add, fixing out-of-order centroids #209
base: main
Are you sure you want to change the base?
Conversation
…al and rounding issues in Centroid#add, fixing out-of-order centroids
Not sure if the Windows errors are legit or just an issue with Windows GH actions. |
Your change might help, but there is a fundamental problem in AVLTreeDigest that order is not absolutely guaranteed due to an entire class of problems of which the one that you point out is only an example. The problem is far worse when you have several centroids with identical means. In such a case, any change to the mean should result in potentially large changes to the order in the tree, but the code doesn't handle this well. This (and my inability to find enough time to dig in on the problem) is one of the two major reasons that MergingDigest is strongly recommended. The other reason is that AVLTreeDigest hasn't had the serious attention to detail regarding interpolation special cases that MergingDigest has had. Again, scarcity of time is at the root. In any case, AVLTreeDigest requires dynamic allocation and there is no chance of a zero serialization reader or writer. These are secondary issues, but still weigh in a bit. |
👍
Good to know!!
I believe that could explain why the other implementations I've seen so far appear to be significantly simpler than both AVL & Merge digests here. Looks to me like FaceBook's C++ impl, Arrow, CamDavidsonPilon/tdigest, the other one for Postgres, etc., have adapted and also simplified the code. But any way, happy to help with the AVL if there are easy issues, otherwise I will probably start playing with the merge t-digest now 🙂 Let me know if you'd like me to close this PR, or leave it open (maybe as draft?) for some future work. Thanks @tdunning ! |
We need to keep this open if there isn't another issue for this.
Regarding simplicity, a big part of the complexity in the Java
implementation comes from the fact that I wanted a vector of structs which
is very inefficient due to boxing. Thus, the struct with multiple vectors.
This shift required a special sort implementation and makes all of the
loops less obvious.
The Julia implementation is the opposite. Julia emits efficient code for an
array of structs and the internal sort handles that vector cleanly.
This same would apply to C++ if you use the standard library. For python,
you could do something similar and depending on a built in sort would
actually be pretty important for performance.
…On Tue, May 9, 2023 at 8:43 AM Bruno P. Kinoshita ***@***.***> wrote:
Your change might help, but there is a fundamental problem in
AVLTreeDigest that order is not absolutely guaranteed due to an entire
class of problems of which the one that you point out is only an example.
👍
The problem is far worse when you have several centroids with identical
means. In such a case, any change to the mean should result in potentially
large changes to the order in the tree, but the code doesn't handle this
well. This (and my inability to find enough time to dig in on the problem)
is one of the two major reasons that MergingDigest is strongly recommended.
Good to know!!
The other reason is that AVLTreeDigest hasn't had the serious attention to
detail regarding interpolation special cases that MergingDigest has had.
Again, scarcity of time is at the root
In any case, AVLTreeDigest requires dynamic allocation and there is no
chance of a zero serialization reader or writer. These are secondary
issues, but still weigh in a bit.
I believe that could explain why the other implementations I've seen so
far appear to be significantly simpler than both AVL & Merge digests here.
Looks to me like FaceBook's C++ impl, Arrow, CamDavidsonPilon/tdigest, the
other one for Postgres, etc., have adapted and also simplified the code.
But any way, happy to help with the AVL if there are easy issues,
otherwise I will probably start playing with the merge t-digest now 🙂
Let me know if you'd like me to close this PR, or leave it open (maybe as
draft?) for some future work.
Thanks @tdunning <https://github.com/tdunning> !
—
Reply to this email directly, view it on GitHub
<#209 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6SKBE64EMG55TIYXWLXFJQZJANCNFSM6AAAAAAXZ75SW4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Closes #169
Hi @tdunning 👋
I thought this issue was interesting and would allow me to understand more about the t-digest. Tweaking the values used in the tests, I found that changing the compression to
5
(in thefactory()
method) and using this would always reproduce the bug.Debugging a little more, I realized the issue was happening when the
count[]
array was updated, and the value of a centroid became3
(search the line withcount = 3
, note how the value is out of order).A little more debugging and I think I found what was happening. Apparently the
Centroid#add
method could end up having to calculate thiscentroid += w * (x - centroid) / count;
, which would not always return the expected value (see the third/sixth/eleventh lines)I believe it's due to multiplying by
3
to then divide by3
not being computed in a stable way. I shuffled the terms in the expression, hopefully without changing what it was doing, and I think at least that error should now be fixed? I also tried to add a simple test to reproduce it.Fixing this should produce centroids with the correct values and order, thus fixing #169
Cheers
Bruno