Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knn and nystrom #143

Open
wants to merge 116 commits into
base: main
Choose a base branch
from
Open

Knn and nystrom #143

wants to merge 116 commits into from

Conversation

huddyyeo
Copy link

First request

@jeanfeydy jeanfeydy self-requested a review March 2, 2021 17:35
@jeanfeydy jeanfeydy self-assigned this Mar 2, 2021
@jeanfeydy jeanfeydy added the enhancement New feature or request label Mar 2, 2021
@jeanfeydy jeanfeydy self-requested a review April 22, 2021 15:50
Copy link
Contributor

@jeanfeydy jeanfeydy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine! The KNN code is now 99% complete, but there remains quite a lot of work to be done on the Nystroem side. We'll talk about it tomorrow: see you soon :-)

GenericLazyTensor = TypeVar("GenericLazyTensor")


class GenericNystrom:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the scikit-learn convention, we should probably call everything "Nystroem" instead of "Nystrom" (from classes to file names).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaniel

Comment on lines 31 to 43
n_components = how many samples to select from data.
kernel = type of kernel to use. Current options = {rbf:Gaussian,
exp: exponential}.
sigma = exponential constant for the RBF and exponential kernels.
eps = size for square bins in block-sparse preprocessing.
k_means = number of centroids for KMeans algorithm in block-sparse
preprocessing.
n_iter = number of iterations for KMeans.
dtype = type of data: np.float32 or np.float64
inv_eps = additive invertibility constant for matrix decomposition.
verbose = set True to print details.
random_state = to set a random seed for the random sampling of the samples.
To be used when reproducibility is needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that we follow the Google docstring convention, which is documented here.

self.tools = None
self.LazyTensor = None

self.device = "cuda" if pykeops.config.gpu_available else "cpu"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use the device that is used originally by the input data? I don't know how self.device = "cuda" would handle multi-GPU configurations.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaniel

self.verbose = verbose
self.random_state = random_state
self.tools = None
self.LazyTensor = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As detailed in the previous review, I think that using self.LazyTensor as an attribute name is a bit dangerous. You could use e.g. self.lazy_tensor instead?

# Set default sigma
# if self.sigma is None and self.kernel == 'rbf':
if self.sigma is None:
self.sigma = np.sqrt(x.shape[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the scikit-learn default? I would have expected a set value, or something like 10% of the diameter of the data.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 105 to 107
self.K_nq = K_nq # dim: number of samples x num_components
self.K_nq.backend = "GPU_2D"
self.normalization = normalization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with NumPy, "GPU_2D" should only be applied on the "transposed" term.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaniel

Comment on lines 57 to 60
x_LT = LazyTensor(x.unsqueeze(0).to(device))
y_LT = LazyTensor(y.unsqueeze(1).to(device))
d = ((x_LT - y_LT) ** 2).sum(-1)
true_nn = d.argKmin(K=k, dim=1).long()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of consistency, it would be better to use the names x_i, y_j, D_ij, indices and add the expected array shapes in the comments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 66 to 82
def accuracy(indices_test, indices_truth):
"""
Compares the test and ground truth indices (rows = KNN for each point in dataset)
Returns accuracy: proportion of correct nearest neighbours
"""
N, k = indices_test.shape

# Calculate number of correct nearest neighbours
accuracy = 0
for i in range(k):
accuracy += torch.sum(indices_test == indices_truth).float() / N
indices_truth = torch.roll(
indices_truth, 1, -1
) # Create a rolling window (index positions may not match)
accuracy = float(accuracy / k) # percentage accuracy

return accuracy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we use this function in several tutorials, shouldn't we factor it somewhere in the utils module?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done for torch utils

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and for numpy utils

Comment on lines 98 to 101
x_LT = LazyTensor(x.unsqueeze(0).to(device))
y_LT = LazyTensor(y.unsqueeze(1).to(device))
d = ((x_LT - y_LT) ** 2).sum(-1)
true_nn = d.argKmin(K=k, dim=1).long()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this code is used above, shouldn't we factor it in a function to avoid copy-pastes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 114 to 116
########################################################################
# NNDescent search using clusters and Manhattan distance
# Second experiment with N=$10^6$ points in dimension D=3, with 5 nearest neighbors and manhattan distance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we factor the timings/display code above in a function, and call it again with a different metric function instead of copy-pasting it below?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants