Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add benchmark for tfidf #239

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

GoWind
Copy link
Contributor

@GoWind GoWind commented Nov 23, 2024

Add a vanilla Rust program for calculating tfidf scores and calculating the top 10 similar documents

We use a cosine similarity for calculating relevancy of documents for a given query.

@ashvardanian , thanks for your time !
Wanted to check with you on the approach to see if it makes sense to you as a valid benchmark

Use the leipzig1m and the § XL Sum datasetas corpuses. I assumes that each 10000 lines in theleipzig1m` dataset is a single document and then calculate the tf-idf scores for each (term, document) in the corpus.

  1. Given a query, calculate the tf-idf score of a given query based on the same corpus (and assume the query to be a separate document)

  2. The next step is to calculate the cosine score. I couldn't fine a good source for where I can calculate the score as a vector for a query or a document (Claude gave me a fn that I could possibly use)

  3. Once we have a cosine similarity score, sort and fetch top 10.

For benchmarking SImSIMD, I assume the cosine part is where I can use methods from SimSIMD and benchmark it against the vanilla implementation ?
And for the query, in memchr vs stringzilla you basically pick a random set of tokens and then benchmark searching from left and right using memchr and stringzilla . I was thinking of doing something similar by picking terms at random from the corpus and constructing random queries to benchmark.

Add a benchmark for calculating tfidf scores and calculating the top 10 similar
documents

We use a cosine similarity for calculating relevancy of documents for a given query
@ashvardanian
Copy link
Owner

Hi @GoWind! The biggest SimSIMD improvement should come from the Sparse kernels. Have you managed to use them in your TFIDF implementation?

@GoWind
Copy link
Contributor Author

GoWind commented Nov 23, 2024

@ashvardanian , not yet, I am trying to figure out what could be the ideal benchmark (and also learning how to use TF-IDF)

In scikitlearn, the TfIdfVectorizer creates an 2d array for the documents where each row is the document, and there is a column for each "token" in the corpus and
array[row][column] = frequency of the token in the document

Similarly, we can tokenize the query and run intersect between the query (as a row vector) and each document in our corpus to get the intersection size.
the intersection would be if the frequency of every unique token in the query matches the frequency of the term in the document (something like using simsimd_intersect ). Is that what you had in mind using sparse kernels ?

@ashvardanian
Copy link
Owner

@GoWind, the intersect function may not be the only one you need. Also look into: spdot_counts and spdot_weights 🤗

@GoWind
Copy link
Contributor Author

GoWind commented Nov 23, 2024

will do. Thanks for the pointers ! Also reading through the scikit implementation to see how I can possibly do this :)

@GovindarajanNagarajan-TomTom
Copy link
Contributor

Hi @ashvardanian , making progress on the tfidf based similarity calculator.
I noticed some discrepancy when calculating the cosine similarity for vec of f64s via the Rust bindings
The plain rust cosine calculations match the values I get both from numpy and from the simsimd bindings via Python
For the implementation I tried to compute a vector of tfidf_values per query and then compute a cosine similarity between the query and each document , based on the answers on the SO question

Here is how i prepared the script (added a few hacks to test quickly, will update the scripts to a be [[bench]] in the subsequent commits)

head -n 10000 leipzig1m.txt > leipzig10000.txt
cargo run --bin tfidf leipzig10000.txt

I took the first 10k lines and batched them into 10 document of 1k lines each.
The query (hardcoded) in the script is transformed into a vector representation and I compute the cosine similarities

Similarity for document via simsimd 0: Some(0.5928754241421308)
Similarity for document via plain cosine similarity 0: Some(0.40712457585786876)
Similarity for document via simsimd 1: Some(0.5993249839897541)
Similarity for document via plain cosine similarity 1: Some(0.4006750160102468)
Similarity for document via simsimd 2: Some(0.5914559242162761)
Similarity for document via plain cosine similarity 2: Some(0.408544075783724)
Similarity for document via simsimd 3: Some(0.5998267820476098)
Similarity for document via plain cosine similarity 3: Some(0.40017321795239075)
Similarity for document via simsimd 4: Some(0.5906444006555799)
Similarity for document via plain cosine similarity 4: Some(0.40935559934442023)
Similarity for document via simsimd 5: Some(0.5902192553116478)
Similarity for document via plain cosine similarity 5: Some(0.4097807446883521)
Similarity for document via simsimd 6: Some(0.5943923707602529)
Similarity for document via plain cosine similarity 6: Some(0.4056076292397477)
Similarity for document via simsimd 7: Some(0.6028015678055032)
Similarity for document via plain cosine similarity 7: Some(0.3971984321944968)
Similarity for document via simsimd 8: Some(0.5957380843868555)
Similarity for document via plain cosine similarity 8: Some(0.4042619156131435)
Similarity for document via simsimd 9: Some(0.5913356879962984)
Similarity for document via plain cosine similarity 9: Some(0.4086643120037022)

Not sure if I am doing something wrong, but could there be some sort of discrepancy here ?

@ashvardanian
Copy link
Owner

Seems like one is x and the other is 1-x. One is a similarity score and the other is a distance.

@GoWind
Copy link
Contributor Author

GoWind commented Nov 27, 2024

Ah, I see. now it makes sense


SIMSIMD_INTERNAL simsimd_distance_t _simsimd_cos_normalize_f64_neon(simsimd_f64_t ab, simsimd_f64_t a2,
                                                                    simsimd_f64_t b2) {
    if (a2 == 0 && b2 == 0) return 0;
    if (ab == 0) return 1;
    simsimd_f64_t squares_arr[2] = {a2, b2};
    float64x2_t squares = vld1q_f64(squares_arr);
    ......
    rsqrts = vmulq_f64(rsqrts, vrsqrtsq_f64(vmulq_f64(squares, rsqrts), rsqrts));
    rsqrts = vmulq_f64(rsqrts, vrsqrtsq_f64(vmulq_f64(squares, rsqrts), rsqrts));
    vst1q_f64(squares_arr, rsqrts);
    simsimd_distance_t result = 1 - ab * squares_arr[0] * squares_arr[1];
    return result > 0 ? result : 0;
}

@GoWind
Copy link
Contributor Author

GoWind commented Dec 8, 2024

I was able to get a benchmark of running the cosine search across a database against a query and the SimSIMD version run upto 5x faster than the plain Rust version.

I am still not sure how to use spdot here instead of a cosine similarity. Can you give me some pointers on how I can employ spdotand benchmark it (now I have most of the foundation for a TF-IDF implementation, reckon it should be easier)

Benchmarks done on a M2 Pro with 32GB of Ram

For the dataset I used the first 10k lines from the Leipzig dataset head -n 10000 leipzig1m.txt > leipzig10000.txt

warning: `simsimd` (bench "tfidf") generated 1 warning (run `cargo fix --bench "tfidf"` to apply 1 suggestion)
    Finished bench [optimized] target(s) in 0.08s
     Running scripts/bench_tfidf.rs (target/release/deps/tfidf-601c525378540ae8)
Gnuplot not found, using plotters backend
Benchmarking TF-IDF Similarity/SimSIMD Cosine Similarity: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.7s or enable flat sampling.
TF-IDF Similarity/SimSIMD Cosine Similarity
                        time:   [120.97 ms 121.69 ms 122.57 ms]
                        change: [-10.060% -6.1234% -2.7896%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking TF-IDF Similarity/Rust plain Cosine similarity: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.4s.
TF-IDF Similarity/Rust plain Cosine similarity
                        time:   [534.30 ms 535.29 ms 536.42 ms]
                        change: [-8.7229% -5.3702% -2.4358%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

@ashvardanian
Copy link
Owner

Thanks for sharing these benchmarks—great to see SimSIMD achieving a 5x speedup over the Rust implementation! To use spdot, you can directly compute the weighted dot product of TF-IDF vectors without normalizing them, as required for cosine similarity. This is both faster and aligns naturally with the sparse representation of TF-IDF.

For benchmarking, precompute the TF-IDF vectors as sparse arrays (indices for term IDs, weights for TF-IDF values) and replace the cosine similarity calculation with spdot. This avoids unnecessary operations like normalization and takes full advantage of sparsity. Let me know how the updated results compare—excited to see the numbers!

@GoWind
Copy link
Contributor Author

GoWind commented Dec 8, 2024

you can directly compute the weighted dot product of TF-IDF vectors without normalizing them, as required for cosine similarity. This is both faster and aligns naturally with the sparse representation of TF-IDF.
precompute the TF-IDF vectors as sparse arrays (indices for term IDs, weights for TF-IDF values)

Got it. thanks for the pointers. I will setup the benchmarks and share numbers again. This does look promising !

ashvardanian added a commit to ashvardanian/stringzilla-benchmarks-rs that referenced this pull request Dec 8, 2024
@ashvardanian
Copy link
Owner

Hey @GoWind! I've added a placeholder file for this benchmark on the main-dev branch of this repo. Let's move it there 🤗

Thanks!

@GoWind
Copy link
Contributor Author

GoWind commented Dec 9, 2024

Sure, will move the benchmark code to the repo linked !

@GoWind
Copy link
Contributor Author

GoWind commented Dec 10, 2024

Hi @ashvardanian , I noticed that there isn't a NEON implementation for simsimd_spdot_weights_u16 (or simsimd_spdot_counts_u16 either). I think it might make sense to add it as part of the benchmark as I am running this on a Mac (and other ARM devices might benefit anyway)

Also, I wrote the tfidf benchmark using f64 and realized that there isn't a spdot version for u16 with f32 or f64 weights either. Does it make sense to add them as well ?

Also , there seems to be simsimd native bf16 type and a half bf16 types and both atleast at the type level are not compatible. Do you think it might be safe to use the half::bf16 types and then cast them to simsimd bf16s ?

@GoWind
Copy link
Contributor Author

GoWind commented Jan 2, 2025

Hi @ashvardanian , a Happy New Year,

I didn't have a lot of time to work on this, but could squeeze out some time during the holidays to finish up some stuff.
I wrote a f32 neon implementation for spdot. I noticed that there seems to be only implementation for sve2 for bf16. My algorithm doesn't seems to be slower than the serial one and I was wondering if you could provide any insights for me on maybe speeding it up ?

The vectors I am comparing are about 20-50 elements big. I will try to do one more round with a larger number of elements to see if the algorithm gets faster beyond a certain size.

Would you also be interested in a f16 implementation on NEON ? If there are any improvements I can make to the f32 version, I think I might be able to port them to the f16 / bf16 version as well.

All measurements done on an M2 PRO with 32GB of RAM.

Screenshot 2025-01-02 at 16 37 16

The SIMD version was reporting the number of matched items between the Sparse vectors
incorrectly. Fix the bug and remove unnecessary variables from the functions

Add a bin file with custom loop for calculating benchmarks. This is faster to run
and does averaging of timing runs for various vector sizes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants