-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Knn and nystrom #143
base: main
Are you sure you want to change the base?
Knn and nystrom #143
Conversation
This reverts commit d541aeb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine! The KNN code is now 99% complete, but there remains quite a lot of work to be done on the Nystroem side. We'll talk about it tomorrow: see you soon :-)
pykeops/common/nystrom_generic.py
Outdated
GenericLazyTensor = TypeVar("GenericLazyTensor") | ||
|
||
|
||
class GenericNystrom: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the scikit-learn convention, we should probably call everything "Nystroem" instead of "Nystrom" (from classes to file names).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yaniel
pykeops/common/nystrom_generic.py
Outdated
n_components = how many samples to select from data. | ||
kernel = type of kernel to use. Current options = {rbf:Gaussian, | ||
exp: exponential}. | ||
sigma = exponential constant for the RBF and exponential kernels. | ||
eps = size for square bins in block-sparse preprocessing. | ||
k_means = number of centroids for KMeans algorithm in block-sparse | ||
preprocessing. | ||
n_iter = number of iterations for KMeans. | ||
dtype = type of data: np.float32 or np.float64 | ||
inv_eps = additive invertibility constant for matrix decomposition. | ||
verbose = set True to print details. | ||
random_state = to set a random seed for the random sampling of the samples. | ||
To be used when reproducibility is needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that we follow the Google docstring convention, which is documented here.
pykeops/common/nystrom_generic.py
Outdated
self.tools = None | ||
self.LazyTensor = None | ||
|
||
self.device = "cuda" if pykeops.config.gpu_available else "cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we use the device that is used originally by the input data? I don't know how self.device = "cuda"
would handle multi-GPU configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yaniel
pykeops/common/nystrom_generic.py
Outdated
self.verbose = verbose | ||
self.random_state = random_state | ||
self.tools = None | ||
self.LazyTensor = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As detailed in the previous review, I think that using self.LazyTensor
as an attribute name is a bit dangerous. You could use e.g. self.lazy_tensor
instead?
pykeops/common/nystrom_generic.py
Outdated
# Set default sigma | ||
# if self.sigma is None and self.kernel == 'rbf': | ||
if self.sigma is None: | ||
self.sigma = np.sqrt(x.shape[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the scikit-learn default? I would have expected a set value, or something like 10% of the diameter of the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pykeops/torch/nystrom/nystrom.py
Outdated
self.K_nq = K_nq # dim: number of samples x num_components | ||
self.K_nq.backend = "GPU_2D" | ||
self.normalization = normalization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with NumPy, "GPU_2D" should only be applied on the "transposed" term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yaniel
x_LT = LazyTensor(x.unsqueeze(0).to(device)) | ||
y_LT = LazyTensor(y.unsqueeze(1).to(device)) | ||
d = ((x_LT - y_LT) ** 2).sum(-1) | ||
true_nn = d.argKmin(K=k, dim=1).long() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of consistency, it would be better to use the names x_i
, y_j
, D_ij
, indices
and add the expected array shapes in the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
def accuracy(indices_test, indices_truth): | ||
""" | ||
Compares the test and ground truth indices (rows = KNN for each point in dataset) | ||
Returns accuracy: proportion of correct nearest neighbours | ||
""" | ||
N, k = indices_test.shape | ||
|
||
# Calculate number of correct nearest neighbours | ||
accuracy = 0 | ||
for i in range(k): | ||
accuracy += torch.sum(indices_test == indices_truth).float() / N | ||
indices_truth = torch.roll( | ||
indices_truth, 1, -1 | ||
) # Create a rolling window (index positions may not match) | ||
accuracy = float(accuracy / k) # percentage accuracy | ||
|
||
return accuracy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we use this function in several tutorials, shouldn't we factor it somewhere in the utils module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done for torch utils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and for numpy utils
x_LT = LazyTensor(x.unsqueeze(0).to(device)) | ||
y_LT = LazyTensor(y.unsqueeze(1).to(device)) | ||
d = ((x_LT - y_LT) ** 2).sum(-1) | ||
true_nn = d.argKmin(K=k, dim=1).long() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this code is used above, shouldn't we factor it in a function to avoid copy-pastes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
######################################################################## | ||
# NNDescent search using clusters and Manhattan distance | ||
# Second experiment with N=$10^6$ points in dimension D=3, with 5 nearest neighbors and manhattan distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we factor the timings/display code above in a function, and call it again with a different metric function instead of copy-pasting it below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
added accuracy to torch tools
added accuracy to torch tools
shifted accuracy to torch.utils
shifted accuracy to torch.utils
factorized timing function
First request