-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Add sparse index support to knowhere #199
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zhengbuqian The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
… added related unit test Signed-off-by: Buqian Zheng <[email protected]>
5f21faa
to
26a6b8e
Compare
Signed-off-by: Buqian Zheng <[email protected]>
include/knowhere/utils.h
Outdated
} | ||
table_t | ||
pop() { | ||
std::pop_heap(pool_.begin(), pool_.begin() + size_--); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please
std::pop_heap(pool_.begin(), pool_.begin() + size_);
size_ -= 1;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dist += it->second * data[k]; | ||
} | ||
} | ||
if (dist > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code does not distinguish two very different situations:
dist=0
because no item was found inquery
dist=0
because the match is exact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only IP is supported thus dist=0
doesn't mean exact match?
src/common/comp/brute_force.cc
Outdated
int result_size = heap.size(); | ||
for (int64_t j = result_size - 1; j >= 0; --j) { | ||
cur_labels[j] = heap.top().id; | ||
cur_distances[j] = -heap.top().distance; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
working with negative distances twice... I'd rather change MinMaxHeap
implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
std::sort(q_vec.begin(), q_vec.end(), | ||
[](const auto& lhs, const auto& rhs) { return std::abs(lhs.second) > std::abs(rhs.second); }); | ||
while (q_vec.size() && q_vec[0].second * drop_ratio_search > q_vec.back().second) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while (q_vec.size() && ...)
-> while (!q_vec.empty() && ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
private: | ||
[[nodiscard]] float | ||
dot_product(std::unordered_map<IndicesT, T> q_map, table_t u) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dot_product(const std::unordered_map<IndicesT, T>& q_map, table_t u) const {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
void | ||
refine_and_collect(std::unordered_map<IndicesT, T>& q_map, MinMaxHeap<T> inaccurate, size_t k, float* distances, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void refine_and_collect(const std::unordered_map<IndicesT, T>& q_map, MinMaxHeap<T>& inaccurate, size_t k, float* distances, label_t* labels) const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if (refine_factor == 1) { | ||
collect_result(heap, distances, labels); | ||
} else { | ||
std::unordered_map<IndicesT, T> q_map; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the size of the q_map
is known beforehand (it is len
). So, please construct q_map
with the provided number of buckets (I'd try 4 * len
as a start), so
std::unordered_map<IndicesT, T> q_map(4 * len);
The number of buckets impacts the performance of the following dot_product()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or I can tweak the performance as soon as this commit lands, there are many things to play with there :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! I am using 4*len as a start and leaving a TODO. Focusing on the functionality for this PR.
|
||
template <typename HeapType> | ||
void | ||
collect_result(HeapType heap, float* distances, label_t* labels) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HeapType& heap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
} | ||
Cursor(const Cursor& rhs) = delete; | ||
Cursor(Cursor&& rhs) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't Cursor(Cursor&& rhs) noexcept = default;
work?
Same question for Cursor& operator=(Cursor&& rhs) noexcept
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am changing lut_
to be a const ref member, move constructor, move assignment, copy assignment will now be deleted, and I am still explicitly deleting the copy constructor as we don't currently have a usage.
swap(lhs.q_value_, rhs.q_value_); | ||
// all cursors share the same bitset | ||
} | ||
const std::vector<Neighbor<T>>* lut_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::vector<Neighbor<T>>* lut_ = nullptr
;
Also, using vector<>*
leads to double pointer dereferencing, unless the compiler optimizes this away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I am removing the swap method and changing lut_
to const std::vector<Neighbor<T>>&
.
size_t num_vec_ = 0; | ||
float max_score_ = 0.0f; | ||
float q_value_ = 0.0f; | ||
const BitsetView* bitset_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BitsetView is a lightweight structure (a wrapper around the pointer, basically) that should be passed by value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…to const ref; update multiple InvertedIndex member functions to accept reference type for container types; addressed other review comments Signed-off-by: Buqian Zheng <[email protected]>
the functionality seems to be fine, but what about the unit tests running time, is it expected? |
|
||
private: | ||
[[nodiscard]] float | ||
dot_product(const std::unordered_map<IndicesT, T>& q_map, table_t u) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::unordered_map seems with a bad performance as I tested. It's better to use two arrays indices
and data
…compute refined distance. Signed-off-by: Buqian Zheng <[email protected]>
I added some large unit tsets so ut took quite long, I have removed them. Will revert the change in ut yaml before merge. |
Signed-off-by: Buqian Zheng <[email protected]>
/run-e2e |
run-e2e |
/lgtm overall |
/run-e2e |
1 similar comment
/run-e2e |
/hold |
Use #341 instead |
/kind feature
#193
This PR adds the basic framework of supporting sparse index in knowhere, with 2 index types: inverted index and wand.
In this PR only Search is supported.
Support for growing segment is sub-optimal, we naively double the underlying data structure each time we used up the current space.
Added basic unit tests: