Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ Improvements - API enhancement and increase testing #85

Merged
merged 8 commits into from
Sep 10, 2024

Conversation

stephen29xie
Copy link
Contributor

@stephen29xie stephen29xie commented Aug 26, 2024

Changes Made

C++

  • Add overloaded Index::addItems and Index::query methods that accept 2D vector of floats (std::vector<std::vector<float>>) instead of an NDArray.
    • When the Java and Python bindings add items or query, they convert the Python and Java objects to a 2-D NDArray. The NDArray is abstracted. We should have an option to abstract that in the C++ interface too to maintain ease-of-use and intuitiveness, especially If Voyager is used as a standalone C++ library (cpp wrappers are using default namespace #52).
    • This is used in the unit test when we add items or query the index
  • Add C++ tests
    • A C++ test framework was added recently in C++ improvements #63 but there is only some basic Index property checking
    • These additional tests generate random vectors and add them to an index. Each of these vectors is then used as a target vector in a query. We expect to see that the ANN distance is ~0 (allowing for some precision error based on storage type), and that the ANN is itself.
    • We technically still one have only one test. We don't have support for test parameterization yet. As a workaround we use nested loops to enumerate and test different combinations of parameters.
    • By nature of HNSW, the test is nondeterministic. Even with considerations for lower precision data storage types, it is possible that a test run will fail because we are testing that the true NN is equal to the ANN.

Testing

Added C++ tests

Related Issues

In these tests, we query with k = 1 (single nearest neighbour). If you change it or add an additional test with k = num_vectors_in_index, you can sporadically reproduce this issue: #38

Checklist

  • My code follows the code style of this project.
  • I have added and/or updated appropriate documentation (if applicable).
  • All new and existing tests pass locally with these changes.
  • I have run static code analysis (if available) and resolved any issues.
  • I have considered backward compatibility (if applicable).
  • I have confirmed that this PR does not introduce any security vulnerabilities.

Additional Comments

@stephen29xie stephen29xie changed the title [WIP] C++ Improvements C++ Improvements - API enhancement and increase testing Aug 29, 2024
Comment on lines +7 to +8
# Add compiler flags
target_compile_options(VoyagerTests PRIVATE -g)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-g flag builds executable with debugging symbols to use with a debugger

@stephen29xie stephen29xie marked this pull request as ready for review August 29, 2024 16:22
Comment on lines 321 to 325
// flatten the 2d array into the NDArray's underlying 1D vector
std::vector<float> flatArray;
for (const auto &vector : vectors) {
flatArray.insert(flatArray.end(), vector.begin(), vector.end());
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more memory-efficient way of doing this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not memory-efficient, but time-efficient, yes: we should pre-allocate the space in flatArray by calculating numVectors * dimensions, then doing std::memcpy into that buffer for each vector.

What we have here will resize (and potentially reallocate) the vector on each .insert, which makes this O(n2) as .insert is O(n) rather than O(1).

float precisionTolerance) {
// create test data and ids
std::vector<std::vector<float>> inputData =
randomVectors(numVectors, numDimensions);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From meeting:

add conditional statement
if storageType = float32, then randomVectors
if storageType = float8 or e4m3, then randomQuantizedVectors

Copy link
Contributor

@markkohdev markkohdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Left some comments but nice work on this so far :)

/**
* Convert a 2D vector of float to NDArray<float, 2>
*/
NDArray<float, 2> vectorsToNDArray(std::vector<std::vector<float>> vectors) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an explicit unit test for this


#include "array_utils.h"

NDArray<float, 2> randomQuantizedVectorsNDArray(int numVectors,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove these functions and instead just use the randomQuantizedVectors and randomVectors methods

int dimensions = numVectors > 0 ? vectors[0].size() : 0;
std::array<int, 2> shape = {numVectors, dimensions};

// flatten the 2d array into the NDArray's underlying 1D vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than iterating over the outer and inner vectors, we should try to utilize the underlying data access that std::vector provides. Be careful with this though as there may be caveats around vector memory allocation boundaries and actual vector item counts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still iterating over each vector because I added a check to validate that each vector size is identical. But now I am preallocating the space I need in the flattened array to avoid resizing, and using std::memcpy to do the copying

Copy link
Contributor

@markkohdev markkohdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for fixing this up! Looks good to me
🤠 🐴 🏇

@stephen29xie stephen29xie merged commit 88cfc46 into main Sep 10, 2024
57 checks passed
@stephen29xie stephen29xie deleted the stephenx/cpp-improvements branch September 10, 2024 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants