Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: probabilities not integrating to 1. #292

Open
mpilosov opened this issue Jun 15, 2017 · 6 comments
Open

bug: probabilities not integrating to 1. #292

mpilosov opened this issue Jun 15, 2017 · 6 comments

Comments

@mpilosov
Copy link
Contributor

mpilosov commented Jun 15, 2017

I think have now confirmed that this is a bug (after wrestling with this issue all day):

using more than than 1 cell per dimension to discretize observed densities can yield input probabilities not integrating to 1.

What they integrate to is the proportion of bins in the output space that had samples land in them to total bins in output space.


Here's how to recreate the bug easily and see what I mean.
go to examples/contaminant/

Change lines 63-65 to

simpleFunP.regular_partition_uniform_distribution_rectangle_scaled(data_set=my_discretization,
                                                                     Q_ref=Q_ref,
                                                                     rect_scale=1.0,
                                                                     cells_per_dimension = 2)

So that the observed density covers the whole data space and we discretize it with more than 1 cell per dimension. Every input sample will be assigned positive probability (you can check this with print sum(my_discretization._input_sample_set._probabilities>0) and print my_discretization._input_sample_set._check_num

Now, launch iPython and %cpaste the first 75 lines of contaminant.py (everything up until the plotting)
Finally, run
print sum(my_discretization._input_sample_set._probabilities)
and you will get 0.9375 instead of 1 (which is what you get if you run the first 75 lines without the aforementioned change).

In fact, you'll get this same quantity even if you leave rect_scale=0.25 unchanged and only change cells_per_dimension=1 to cells_per_dimension=2 (change it to 3, 4, … and the sum of the probabilities keeps decreasing).

I changed rect_scale=1 just so that I knew I would capture the entirety data space, but it's obviously not important to the recreation of the bug (this is a bug, right? not a feature? Was I wrong to assume probabilities is supposed to integrate to 1?)


I proposed a fix in the comments of issue #288 and with your blessing(s) can go ahead and implement this change (feel free to assign it to me) and submit a pull request.

It's not an elegant fix, so I want to make sure there are no objections first before going forward with it. I'm fairly sure (but could use confirmation) this change would need to apply to the other functions in calculateP as well.

Alternatively, we can let the probabilities not integrate to 1 (perhaps this was intentional?) and just say that it's the probability measure up to a constant of proportionality, and let the user scale it themselves with something like
my_discretization._input_sample_set._probabilities/=sum(my_discretization._input_sample_set._probabilities)

Thoughts? This issue consumed most of my workday today. I hope I actually caught a bug and am not just mistaken in my assumption.

@mathematicalmichael
Copy link
Collaborator

@smattis @eecsu this appears to still be a thing.
Can we please address it so I can close the issue?

@smattis
Copy link
Contributor

smattis commented May 20, 2019

@mathematicalmichael I know this is really old, but I know exactly what is happening here. When no input samples map under the forward map to a cell in the output probability sample set, the probability of that set is not distributed to the parameter space. If the probabilities sum to something less than one, it really identifies that you are not sampling (or emulating) enough in the input space for the given output probability sample set. In this case where the dimension of the output space is 4, changing the number of cells per dimension from 1 to 2 changes the relative volume of the cells from 1.0 to (0.5)^4=0.0625. In the case above, there is one of those 16 cells that are not getting mapped to, so its probability is not being distributed.

We should probably keep it how it is, but raise a warning that you are undersampling, if they don't sum to 1.

@smattis
Copy link
Contributor

smattis commented May 20, 2019

This also shows that with simpleFunP.regular_partition_uniform_distribution_rectangle_scaled, you probably shouldn't use cells_per_dimension > 1 if the dimension of the output space is bigger than 2 or 3.

@mathematicalmichael
Copy link
Collaborator

@smattis thanks for the explanation. Dimensionality certainly can hit us, eh? =)
I think raising a warning would be a great idea. I came across a few instances in my new branches where similar nuances called for warnings and it really helped me later on when I was using the code to run some experiments. I think WARN is appropriate here?

Perhaps adding a LOG or WARN additionally to address cells_per_dimension > 1 ... something like Volume of cells specified may require additional samples. Be aware that probabilities may not sum to unity. and then reinforced with Probabilities do not integrate to unity. This is likely due to under-sampling in high dimension. Consider adding samples using merge.

@eecsu
Copy link
Contributor

eecsu commented May 20, 2019

This is basically a violation of a discrete/set-valued form of the "predictability assumption." It is kind of equivalent to the expected value of r not being near 1 in the sample-based density approach. We should probably think of a way to phrase that into the warning that suggests the user may want to use a different solution approach (e.g., the sampling based approach instead of set-valued approach) to avoid this issue.

@mathematicalmichael
Copy link
Collaborator

mathematicalmichael commented May 20, 2019

@eecsu any thoughts on the phrasing of what I put up? Keep all/part? Or replace entirely with suggestion to use sampling approach (once that gets merged)? I think it’s helpful to explain some context behind why we raise a warning. Can discuss in this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants