-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: probabilities not integrating to 1. #292
Comments
@mathematicalmichael I know this is really old, but I know exactly what is happening here. When no input samples map under the forward map to a cell in the output probability sample set, the probability of that set is not distributed to the parameter space. If the probabilities sum to something less than one, it really identifies that you are not sampling (or emulating) enough in the input space for the given output probability sample set. In this case where the dimension of the output space is 4, changing the number of cells per dimension from 1 to 2 changes the relative volume of the cells from 1.0 to (0.5)^4=0.0625. In the case above, there is one of those 16 cells that are not getting mapped to, so its probability is not being distributed. We should probably keep it how it is, but raise a warning that you are undersampling, if they don't sum to 1. |
This also shows that with |
@smattis thanks for the explanation. Dimensionality certainly can hit us, eh? =) Perhaps adding a |
This is basically a violation of a discrete/set-valued form of the "predictability assumption." It is kind of equivalent to the expected value of r not being near 1 in the sample-based density approach. We should probably think of a way to phrase that into the warning that suggests the user may want to use a different solution approach (e.g., the sampling based approach instead of set-valued approach) to avoid this issue. |
@eecsu any thoughts on the phrasing of what I put up? Keep all/part? Or replace entirely with suggestion to use sampling approach (once that gets merged)? I think it’s helpful to explain some context behind why we raise a warning. Can discuss in this thread. |
I think have now confirmed that this is a bug (after wrestling with this issue all day):
using more than than 1 cell per dimension to discretize observed densities can yield input probabilities not integrating to 1.
What they integrate to is the proportion of bins in the output space that had samples land in them to total bins in output space.
Here's how to recreate the bug easily and see what I mean.
go to
examples/contaminant/
Change lines 63-65 to
So that the observed density covers the whole data space and we discretize it with more than 1 cell per dimension. Every input sample will be assigned positive probability (you can check this with
print sum(my_discretization._input_sample_set._probabilities>0)
andprint my_discretization._input_sample_set._check_num
Now, launch iPython and
%cpaste
the first 75 lines ofcontaminant.py
(everything up until the plotting)Finally, run
print sum(my_discretization._input_sample_set._probabilities)
and you will get 0.9375 instead of 1 (which is what you get if you run the first 75 lines without the aforementioned change).
In fact, you'll get this same quantity even if you leave
rect_scale=0.25
unchanged and only changecells_per_dimension=1
tocells_per_dimension=2
(change it to 3, 4, … and the sum of the probabilities keeps decreasing).I changed
rect_scale=1
just so that I knew I would capture the entirety data space, but it's obviously not important to the recreation of the bug (this is a bug, right? not a feature? Was I wrong to assumeprobabilities
is supposed to integrate to 1?)I proposed a fix in the comments of issue #288 and with your blessing(s) can go ahead and implement this change (feel free to assign it to me) and submit a pull request.
It's not an elegant fix, so I want to make sure there are no objections first before going forward with it. I'm fairly sure (but could use confirmation) this change would need to apply to the other functions in
calculateP
as well.Alternatively, we can let the probabilities not integrate to 1 (perhaps this was intentional?) and just say that it's the probability measure up to a constant of proportionality, and let the user scale it themselves with something like
my_discretization._input_sample_set._probabilities/=sum(my_discretization._input_sample_set._probabilities)
Thoughts? This issue consumed most of my workday today. I hope I actually caught a bug and am not just mistaken in my assumption.
The text was updated successfully, but these errors were encountered: