NearestNeighbors.jl
is a package written in Julia to perform high performance nearest neighbor searches in
arbitrarily high dimensions.
The abstract tree type that the trees in this package are a subtype of is called a NNTree
. A NNTree
is created by the function:
NNTree(data, metric; leafsize, reorder)
data
: The data, i.e., the points to build up the tree from. It can either be- a matrix of size
nd × np
with the points to insert in the tree wherend
is the dimensionality of the points andnp
is the number of points - a vector of vectors with fixed dimensionality,
nd
, which must be part of the type. Specifically,data
should be aVector{V}
, whereV
is itself a subtype of anAbstractVector
and such thateltype(V)
andlength(V)
are defined. (For example, with 3D points,V = SVector{3, Float64}
works becauseeltype(V) = Float64
andlength(V) = 3
are defined inV
.)
- a matrix of size
metric
: The metric to use, defaults toEuclidean
. This is one of theMetric
types defined in theDistances.jl
packages. It is possible to define your own metrics by simply creating new types that are subtypes ofMetric
.leafsize
(keyword argument): Determines at what number of points to stop splitting the tree further. There is a trade-off between traversing the tree and having to evaluate the metric function for increasing number of points.reorder
(keyword argument): While building the tree this will put points close in distance close in memory since this helps with cache locality. In this case, a copy of the original data will be made so that the original data is left unmodified. This can have a significant impact on performance and is by default set totrue
.
There are currently three types of trees available:
BruteTree
: Not actually a tree. It linearly searches all points in a brute force fashion. Works with anyMetric
.KDTree
: In a kd tree the points are recursively split into groups using hyper-planes. Therefore aKDTree
only works with axis aligned metrics which are:Euclidean
,Chebyshev
,Minkowski
andCityblock
.BallTree
: Points are recursively split into groups bounded by hyper-spheres. Works with anyMetric
.
All trees in NearestNeighbors.jl
are static which means that points can not be added or removed from an already created tree.
Here are a few examples of creating trees:
using NearestNeighbors
data = rand(3, 10^4)
# Create trees
kdtree = KDTree(data; leafsize = 10)
balltree = BallTree(data, Minkowski(3.5); reorder = false)
brutetree = BruteTree(data)
A kNN search finds the k
nearest neighbors to given point(s).
This is done with the method:
knn(tree, points, k, sortres = false, skip = always_false) -> idxs, dists
tree
: The tree instancepoints
: A vector or matrix of points to find thek
nearest neighbors to. Ifpoints
is a vector of numbers then this represents a single point, ifpoints
is a matrix then thek
nearest neighbors to each point (column) will be computed.points
can also be a vector of other vectors where each element in the outer vector is considered a point.sortres
(optional): Determines if the results should be sorted before returning. In this case the results will be sorted in order of increasing distance to the point.skip
(optional): A predicate to determine if a given point should be skipped, for example if iterating over points and a point has already been visited.
It is generally better for performance to query once with a large number of points than to query multiple times with one point per query.
As a convenience, if you only want the closest nearest neighbor, you can call nn
instead for a cleaner result:
nn(tree, points, skip = always_false) -> idxs, dists
Some examples:
using NearestNeighbors
data = rand(3, 10^4)
k = 3
point = rand(3)
kdtree = KDTree(data)
idxs, dists = knn(kdtree, point, k, true)
idxs
# 3-element Array{Int64,1}:
# 4683
# 6119
# 3278
dists
# 3-element Array{Float64,1}:
# 0.039032201026256215
# 0.04134193711411951
# 0.042974090446474184
# Multiple points
points = rand(3, 4);
idxs, dists = knn(kdtree, points, k, true);
idxs
# 4-element Array{Array{Int64,1},1}:
# [3330, 4072, 2696]
# [1825, 7799, 8358]
# [3497, 2169, 3737]
# [1845, 9796, 2908]
# dists
# 4-element Array{Array{Float64,1},1}:
# [0.0298932, 0.0327349, 0.0365979]
# [0.0348751, 0.0498355, 0.0506802]
# [0.0318547, 0.037291, 0.0421208]
# [0.03321, 0.0360935, 0.0411951]
# Static vectors
v = @SVector[0.5, 0.3, 0.2];
idxs, dists = knn(kdtree, v, k, true);
idxs
# 3-element Array{Int64,1}:
# 842
# 3075
# 3046
dists
# 3-element Array{Float64,1}:
# 0.04178677766255837
# 0.04556078331418939
# 0.049967238112417205
A range search finds all neighbors within the range r
of given point(s).
This is done with the method:
inrange(tree, points, r, sortres = false) -> idxs
Note that for performance reasons the distances are not returned. The arguments to inrange
are the same as for knn
except that sortres
just sorts the returned index vector.
An example:
using NearestNeighbors
data = rand(3, 10^4)
r = 0.05
point = rand(3)
balltree = BallTree(data)
idxs = inrange(balltree, point, r, true)
# 4-element Array{Int64,1}:
# 317
# 983
# 4577
# 8675
By default, the trees store a copy of the data
provided during construction. This is problematic in case you want to work on data sets which are larger than available memory, thus wanting to mmap
the data or want to store the data in a different place, outside the tree.
DataFreeTree
can be used to strip a constructed tree of its data field and re-link it with that data at a later stage. An example of using a large on-disk data set looks like this:
using Mmap
ndim = 2; ndata = 10_000_000_000
data = Mmap.mmap(datafilename, Matrix{Float32}, (ndim, ndata))
data[:] = rand(Float32, ndim, ndata) # create example data
dftree = DataFreeTree(KDTree, data)
dftree
now only stores the indexing data structures. It can be passed around, saved and reloaded independently of data
.
To perform look-ups, dftree
is re-linked to the underlying data:
tree = injectdata(dftree, data) # yields a KDTree
knn(tree, data[:,1], 3) # perform operations as usual
Kristoffer Carlsson - @KristofferC - [email protected]