-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stability of sample weighting with sample density #541
Comments
Thanks for looking into that. Random unsolicited thought: could we infer the density using a kde? So give each data point a weight inverse to its distance from the target? (Using a gaussian (or gaspari-cohn kernel).) There is probably a nice package for that. |
Yes indeed, we could: import numpy as np
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
np.random.seed(42)
n_samples = 200
predictor_1 = np.random.normal(0, 1, n_samples)
predictor_2 = np.random.normal(0, 1, n_samples)
kde = KernelDensity(bandwidth=0.5, kernel='gaussian')
kde.fit(np.vstack([predictor_1, predictor_2]).T)
# evaluate density on samples
density = np.exp(kde.score_samples(np.vstack([predictor_1, predictor_2]).T))
# Plot
plt.scatter(predictor_1, predictor_2, c=density, cmap='viridis')
plt.colorbar(label='Density')
plt.xlabel('Predictor 1')
plt.ylabel('Predictor 2')
plt.title('2D KDE Density Estimation')
plt.show() However, it seems to be quite a bit slower? import numpy as np
from sklearn.neighbors import KernelDensity
import matplotlib.pyplot as plt
np.random.seed(42)
n_samples = 200
predictor_1 = np.random.normal(0, 1, n_samples)
predictor_2 = np.random.normal(0, 1, n_samples)
def kde(x, y):
kde = KernelDensity(bandwidth=0.5, kernel='gaussian')
kde.fit(np.vstack([predictor_1, predictor_2]).T)
# evaluate density on samples
density = np.exp(kde.score_samples(np.vstack([predictor_1, predictor_2]).T))
weights = 1/density
return weights
def hist(predictor_1,predictor_2):
# Combine into a 2D array
data = np.vstack([predictor_1, predictor_2]).T
# Define bins (like in the original function)
n_bins_density = 25 # 25 # 40
mn, mx = np.nanmin(data, axis=0), np.nanmax(data, axis=0)
bins = np.linspace(
(mn - 0.05 * (mx - mn)),
(mx + 0.05 * (mx - mn)),
n_bins_density,
)
# Create a multidimensional histogram
gmt_hist, edges = np.histogramdd(sample=data, bins=bins.T)
# Calculate the bin centers for interpolation
gmt_bins_center = [0.5 * (edge[1:] + edge[:-1]) for edge in edges]
interp = RegularGridInterpolator(points=gmt_bins_center, values=gmt_hist, method = "linear", bounds_error=False, fill_value=None)
# Calculate weights as the inverse of the interpolated density
weights = 1 / interp(data)
return weights
%timeit kde(predictor_1, predictor_2)
# > 1.37 ms ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit hist(predictor_1, predictor_2)
# > 72.7 µs ± 743 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) |
I think that would be ok if we do it only once per gridpoint (or maybe even once per model). One potential issue: the scale of the data may not be the same in each dimension - so may have to scale the data or the bandwidth. But anyways - factoring this out is a good idea in any case. |
I think once per model should be enough because the predictors are the same for each gridpoint. |
I think in terms of variance; e.g. if we have one predictor that is temperature in K and one that is precip in mm / s or whatnot, which have very different data ranges. (But maybe that's already in there?) |
Hm I mean in theory this is what we want, right @yquilcaille? Higher weights for "underrepresented" samples. I would think that decreasing the bin number might do what you want but it will be hard to find a one fits all solution I guess. |
True, but could it be an artifact from the widening of the bins beyond the range of the data? mesmer/mesmer/mesmer_x/train_l_distrib_mesmerx.py Lines 643 to 647 in 2b5da4a
|
Yeah you might try with just mn and mx @l-pierini, let us know how that goes |
I took some time to look into @yquilcaille's
distrib_cov._get_weights_nll
function. The purpose of the function is to estimate the density of predictor samples in the "predictor space", i.e. giving samples that lie in well sampled regions less weight while giving samples in regions that are not well sampled more weight. I'll illustrate with an example:If we have two predictors, we can plot them against each other and make a 2D histogram, counting the datapoints in each bin. This already gives some kind of density estimation in showing how many samples are in each bin=region in the sample space. To get a continuous density surface we apply a
RectangularGridInterpolator
and evaluate this at the actual sample points. In this way sample points that fall into the same bin do not necessarily get the same weights but a weight that corresponds better to the actual density. Now below is a visualization of this example:Code
The upper left panel shows two predictors plotted against each other with their respective weight in color. The 3D histogram shows the binned data and the blue elevated surface shows the Interpolated density. Now, we see that some points have rather counter intuitive weights, see the point near (1.8, 1.9) that has a very high weight (so supposedly lies in a region of little density) even though it arguably lies in a region of higher density compared to the sample sear (0.4, 3.98). Why is that?
I provide here a script for an interactive 3D plot in python (can be run independently from the top code)
interactive 3D plot with python plotly
When you take a closer look:
![3Dweights2](https://private-user-images.githubusercontent.com/112418493/374096399-273125a7-7577-4226-8e3d-9fbca4c566fd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNTY1NzYsIm5iZiI6MTczOTM1NjI3NiwicGF0aCI6Ii8xMTI0MTg0OTMvMzc0MDk2Mzk5LTI3MzEyNWE3LTc1NzctNDIyNi04ZTNkLTlmYmNhNGM1NjZmZC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMlQxMDMxMTZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT00YzZiZWRhNTgyMDNhZjAxMGMxMGY2NGJmOWQzYTM0YmQ3MzFmNzE5MDMwYWQ5NTcyMjEzODU3Y2RlNTZhYzU0JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.0BwtpBOal42r4acMBhM2PxM84mQu5XAXat0wEVfNr8k)
You see that the red sample in the middle of quite a few other samples falls into a region where the interpolation approaches zero, likely because this sample is at the edge of its bin. So it is likely that the weights of samples are pretty sensitive to the choice of the bins. Here a plot of the same example as in the first graph, just with 25 instead of the default 40 bins:
Here the weights look more intuitive.
Another issue I want to raise is that the fact that we group the data into rectangular bins can lead to unintuitive weights too, see an example of linearly dependent predictors (which we don't expect, but still):
![image](https://private-user-images.githubusercontent.com/112418493/374098789-4d41fc41-07d7-4111-9e31-4dd329fcd2b3.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNTY1NzYsIm5iZiI6MTczOTM1NjI3NiwicGF0aCI6Ii8xMTI0MTg0OTMvMzc0MDk4Nzg5LTRkNDFmYzQxLTA3ZDctNDExMS05ZTMxLTRkZDMyOWZjZDJiMy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMlQxMDMxMTZaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05YTJjZTdkYjRjNTgyMzA5NzZkZDMyYzIyMTI4MmMzNjkwNzE3ZjlmZjM4NTI2OTRlMDIxNjY0NTQyYWQ2NTFkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.quULzwzVLEaNSrRIxqR5RcY7PyNhPzliW4xGGHDWO7o)
Code
All in all, I think for now we should 1) factor this function out of the
init
of the distribution class and give the user more control over the choice of the bins and 2) put a disclaimer into the function description that the weights might be sensitive to the bin choice and add a plotting example so the user can easily inspect the weights like I did above. However, for more than 2 predictors it would get quite hard to make a nice visualization...The text was updated successfully, but these errors were encountered: