Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace mx by scaled #1139

Merged
merged 11 commits into from
Oct 8, 2020
Merged

Replace mx by scaled #1139

merged 11 commits into from
Oct 8, 2020

Conversation

xmnlab
Copy link
Contributor

@xmnlab xmnlab commented Aug 2, 2020

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

Fixes #1134

@codecov
Copy link

codecov bot commented Aug 2, 2020

Codecov Report

Merging #1139 into latest will increase coverage by 9.10%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1139      +/-   ##
==========================================
+ Coverage   84.13%   93.24%   +9.10%     
==========================================
  Files          99       78      -21     
  Lines        9218     5922    -3296     
==========================================
- Hits         7756     5522    -2234     
+ Misses       1462      400    -1062     
Flag Coverage Δ
#rusttests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sourmash/minhash.py 96.98% <100.00%> (-0.30%) ⬇️
src/core/src/sketch/nodegraph.rs
src/core/src/index/bigsi.rs
src/core/src/ffi/mod.rs
src/core/src/lib.rs
src/core/src/index/sbt/mod.rs
src/core/src/signature.rs
src/core/src/index/mod.rs
src/core/src/ffi/cmd/compute.rs
src/core/src/sketch/ukhs.rs
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7a8e5ac...0a858fa. Read the comment docs.

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 4, 2020

@ctb @luizirber it is still raising some errors. I will work on that again today.
is this change what you had in mind? Am I going in the right direction? thanks!

@ctb
Copy link
Contributor

ctb commented Aug 4, 2020

on a quick skim, this looks like the right direction to me!

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 4, 2020

thanks @ctb! I will try to have it ready for review today!

@ctb
Copy link
Contributor

ctb commented Aug 4, 2020 via email

@ctb
Copy link
Contributor

ctb commented Aug 4, 2020

hi @xmnlab note that we just changed the default main branch in this repository to latest - at some point before merge, please update your base to match (click Edit, change base to latest). Thanks!

@xmnlab xmnlab changed the base branch from master to latest August 4, 2020 18:22
@luizirber
Copy link
Member

Nice PR, @xmnlab!

But I think we need to be more explicit in the consequences of changing max_hash to scaled in the functions. My main concern is that current code using max_hash will continue "working" with scaled (since they are the same type), but the meaning of the parameter changed completely!

We can use this opportunity to bring it more in line with the changes happening on the Python side in #1128.
To make it break explicitly and force other crates using sourmash to do the changes, maybe define new as

pub fn new(
  scaled: u64,
  ksize: u32,
  hash_function: HashFunctions,
  track_abundance: bool,
  seed: u64,
  num: Option<u32>,
) -> KmerMinHash

and similarly in the FFI

#[no_mangle]
pub unsafe extern "C" fn kmerminhash_new(
    scaled: u64,
    k: u32,
    prot: bool,
    dayhoff: bool,
    hp: bool,
    track_abundance: bool,
    seed: u64,
    num: u32
) -> *mut SourmashKmerMinHash {

(let @ctb chime in before implementing these changes, @xmnlab =])

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 6, 2020

thanks @luizirber! I updated my branch on top of the "latest" branch. about your suggestions, sounds good for me, I am waiting for a green light from @ctb :)

@ctb
Copy link
Contributor

ctb commented Aug 6, 2020 via email

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 6, 2020

@ctb I will work on that! thanks! :)

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 8, 2020

not sure if I am doing anything wrong, or if it is relevant, but rust scaled_for_max_hash and python _get_scaled_for_max_hash return a different value for number 1

python result -> 18446744073709551616
rust result   -> 18446744073709551615 

if I change the python operation to use // instead of / it seems it returns the same rust result:

int(round(get_minhash_max_hash() // 1, 0))
18446744073709551615 

@luizirber
Copy link
Member

not sure if I am doing anything wrong, or if it is relevant, but rust scaled_for_max_hash and python _get_scaled_for_max_hash return a different value for number 1

python result -> 18446744073709551616
rust result   -> 18446744073709551615 

Good catch! That is definitely a bug, but we never triggered it because setting max_hash=1 is not a viable use case.

if I change the python operation to use // instead of / it seems it returns the same rust result:

int(round(get_minhash_max_hash() // 1, 0))
18446744073709551615 

Instead of fixing the Python code, I think we should fix the Rust code (to guarantee it has the same behavior as any existing code using sourmash).

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 8, 2020

Instead of fixing the Python code, I think we should fix the Rust code (to guarantee it has the same behavior as any existing code using sourmash).

this is the current rust implementation:

pub fn scaled_for_max_hash(max_hash: u64) -> u64 {
    match max_hash {
        0 => 0,
        _ => u64::max_value() / max_hash,
    }
}

maybe I am wrong, but as far as I understood, the rust result is the u64 max value and the python result is equal to the rust max value +1, so the rust call will break (integer overflow). does It make sense?

@luizirber
Copy link
Member

maybe I am wrong, but as far as I understood, the rust result is the u64 max value and the python result is equal to the rust max value +1, so the rust call will break (integer overflow). does It make sense?

You're correct, it was a trick question! (not really, I didn't notice that, and you analyzed correctly =])

In this case, let's keep scaled_for_max_hash in Rust as is.

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 8, 2020 via email

@luizirber
Copy link
Member

Sounds pretty good! Do you want I change it here in this PR? Or should I fix that in a new one?

Here is good

@luizirber
Copy link
Member

so, I just hit this in another PR, and.. I think the Rust function is broken =]

(it wasn't used in any place, and there are no tests, so...)

I tried checking the scaled for 184467440737095520, and Python gives 100 while Rust says 99...

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 9, 2020

do you have any idea how to fix that? is there any scientific paper that describes this function?

@luizirber
Copy link
Member

luizirber commented Aug 9, 2020

do you have any idea how to fix that?

Before thinking on how to fix it, this is the perfect example for using property-based testing (because we have two functions that applied in sequence should return the initial value). In Python we have hypothesis tests, and in Rust there is proptest.

In tests/test__minhash_hypothesis.py:

 @given(st.integers(min_value=1, max_value=2**64 - 1))                                                                                                                                        
 def test_scaled_max_hash_conversion(scaled):                                                                                                                                                 
     max_hash = _get_max_hash_for_scaled(scaled)                                                                                                                                              
     new_scaled = _get_scaled_for_max_hash(max_hash)                                                                                                                                          
                                                                                                                                                                                              
     assert scaled == new_scaled  

in src/core/tests/minhash.rs:

proptest! {
#[test]
fn scaled_and_max_hash(scaled in u64::ANY) {
    let max_hash = max_hash_for_scaled(scaled).unwrap();
    let new_scaled = scaled_for_max_hash(max_hash);

    assert_eq!(scaled, new_scaled);
}
}

is there any scientific paper that describes this function?

Ouch. Not yet =]

The main issue is that the definition of the function doesn't worry enough about numerical precision...
max_hash and scaled are the inverse of each other
max_hash = max_value / scaled
scaled = max_value / max_hash
where max_value is (2 ** 64) - 1 for 64-bit integers.

But, that division is not well defined (but rounding floats is definitely not the solution! Probably proper floor and ceil operations). I'll think a bit about it and comment again later.

@luizirber
Copy link
Member

OMG what a gigantic rabbit hole.

But, that division is not well defined (but rounding floats is definitely not the solution! Probably proper floor and ceil operations). I'll think a bit about it and comment again later.

The main design constrains:

  • We need to remain compatible with already existing scaled MinHash sketches
  • Realistic scaled values are in the range (1, 100000)

From that:

  • The tests I suggested point very quickly that it is hard to invert max_hash/scaled for scaled values outside the range (like 42537073439)
  • Fixing for values outside the range might mess values inside the range...
  • Hypothesis tests in Python with the realistic range don't fail with the current Python impl.

So I guess we should be bug-compatible in Rust too.

Thoughs, @ctb?

@ctb
Copy link
Contributor

ctb commented Aug 9, 2020

My hot takes --

  • bug compatibility is good, for now!
  • is there a way to ensure that the round trip continues to work, as in scaled == get_scaled_for_max_hash(get_max_hash_for_scaled(scaled)), and each function returns integer values?
  • some tests (especially ones that discover absurdities) would be most welcome

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 9, 2020

thanks @ctb and @luizirber for all the explanations and suggestions.

but I am not too sure how to move forward now ...

@luizirber
Copy link
Member

* is there a way to ensure that the round trip continues to work, as in `scaled == get_scaled_for_max_hash(get_max_hash_for_scaled(scaled))`, and each function returns integer values?

* some tests (especially ones that discover absurdities) would be most welcome

That's what the property tests are doing (it's implemented as you described in the first item, and hypothesis/proptest generate input data like the second one).

In the Hypothesis case we can be more explicit and provide some examples for the values we tend to use:

@given(st.integers(min_value=1, max_value=100000))
@example(1)
@example(10)
@example(100)
@example(1000)
@example(2000)
@example(10000)
def test_scaled_max_hash_conversion(scaled):
    max_hash = _get_max_hash_for_scaled(scaled)
    new_scaled = _get_scaled_for_max_hash(max_hash)
    assert scaled == new_scaled

thanks @ctb and @luizirber for all the explanations and suggestions.

but I am not too sure how to move forward now ...

How about:

  • Add the hypothesis and proptest tests
  • fix the Rust scaled_for_max_hash and max_hash_to_scaled to match the Python impl (using float numbers, and rounding)
  • PROFIT! =P

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 10, 2020

great! I have a good picture now to start! thanks!

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 10, 2020

fix the Rust scaled_for_max_hash and max_hash_to_scaled to match the Python impl (using float numbers, and rounding)

it seems it loses precision

>>> println!("using f64: {}", u64::max_value() as f64);
>>> println!("using u64: {}", u64::max_value());

using f64: 18446744073709552000
using u64: 18446744073709551615

and casting to u64, it will truncate to the max of u64, so it will never be the same as python (for this example the result would be 18446744073709551616):

>>> println!("using ceil: {}", ((u64::max_value() as f64) / 1.0f64).ceil() as u64);
>>> println!("using u64:  {}", u64::max_value());

using ceil: 18446744073709551615
using u64:  18446744073709551615

any idea about how we should handle this problem?

@luizirber
Copy link
Member

luizirber commented Aug 10, 2020

it seems it loses precision

>>> println!("using f64: {}", u64::max_value() as f64);
>>> println!("using u64: {}", u64::max_value());

using f64: 18446744073709552000
using u64: 18446744073709551615

yikes!

(is thix evcxr? Cool!)

and casting to u64, it will truncate to the max of u64, so it will never be the same as python (for this example the result would be 18446744073709551616):

>>> println!("using ceil: {}", ((u64::max_value() as f64) / 1.0f64).ceil() as u64);
>>> println!("using u64:  {}", u64::max_value());

using ceil: 18446744073709551615
using u64:  18446744073709551615

the max value for 64 bits is 2**64 -1, which is 18446744073709551615, so the first version (using ceil) is correct. I think you can use that for implementing the conversion?

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 10, 2020

(is thix evcxr? Cool!)

yes! that is very helpful!

I think you can use that for implementing the conversion?

but the value returned by the current python implementation is still different 18446744073709551616 ... to have the same result with the python implementation I would need to do int(round(get_minhash_max_hash() // 1, 0))

not sure if I am missing something

@luizirber
Copy link
Member

but the value returned by the current python implementation is still different 18446744073709551616 ... to have the same result with the python implementation I would need to do int(round(get_minhash_max_hash() // 1, 0))

_get_max_hash_for_scaled(1) is 18446744073709551615 here, I see 18446744073709551616 when doing _get_scaled_for_max_hash(1).

Since it is out of the realistic scaled values range (I mean, a Scaled MinHash with scaled=18446744073709551616 or max_hash=1 will literally have only (0, 1) as possible values!), I don't think it is a big issue to leave as is (returning 18446744073709551615).

(And even tho we said "bug-compatible", that is a bug we want to fix, because 18446744073709551616 is out of the range of possible 64-bit integers... It is easier to see this in hexadecimal:

>>> "{:X}".format(18446744073709551616)
'10000000000000000'

>>> "{:X}".format(18446744073709551615)
'FFFFFFFFFFFFFFFF'

)

@xmnlab
Copy link
Contributor Author

xmnlab commented Aug 17, 2020

sorry .. I didn't have time too much last week. my vacations start on Monday so I will be back here tomorrow!! thanks for the patience :)

@@ -622,12 +624,12 @@ impl KmerMinHash {
self.check_compatible(other)?;

let mut combined_mh = KmerMinHash::new(
self.num,
scaled_for_max_hash(self.max_hash),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about using

Suggested change
scaled_for_max_hash(self.max_hash),
self.scaled,

here? self.scaled is still using max_hash in its impl, but we can fix that later (and avoid a mention to self.max_hash here =])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luizirber sounds good. could you add more details about that pls? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we have thousands of warnings in CI (example) triggered because we are still using .max_hash internally. Eventually (for 4.0) we want to avoid using/documenting/suggesting .max_hash and focus on .scaled instead, including constructors (and hence why we deprecated .max_hash in 3.5 and generate all the warnings)

But that's all in the Python side, and this PR is for changing the Rust side. So, you don't need to worry about it if you don't want to, but I thought it was relevant to keep in mind =]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see .. thanks for the explanation :)

@luizirber
Copy link
Member

@luizirber should I also change KmerMinHashBTree signature?

Yes, please! KmerMinHash and KmerMinHashBTree have the same API but don't have a trait to connect them, so for now we need to change in both places... #1055 will fix it, but is not a priority.

@xmnlab
Copy link
Contributor Author

xmnlab commented Sep 5, 2020

hey everyone, sorry for the delay here. I will try to finish this as soon as possible.

I have some issue with 2 tests, but not sure yet why ...

running 19 tests
test invalid_dna ... ok
test check_errors ... ok
test dayhoff ... ok
test hp ... ok
test load_save_minhash_sketches ... ok
test load_save_minhash_sketches_abund ... FAILED
test max_for_scaled ... ok
test merge ... ok
test merge_empty_scaled ... ok
test load_save_minhash_dna ... ok
test load_save_minhash_dayhoff ... ok
test load_save_minhash_hp ... ok
test similarity ... ok
test similarity_2 ... ok
test similarity_3 ... ok
test throws_error ... ok
test prop_merge ... FAILED
test oracle_mins_scaled ... ok
test oracle_mins ... ok

failures:

---- load_save_minhash_sketches_abund stdout ----
thread 'load_save_minhash_sketches_abund' panicked at 'called `Result::unwrap()` on an `Err` value: MismatchScaled', src/core/tests/minhash.rs:499:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- prop_merge stdout ----
proptest: FileFailurePersistence::SourceParallel set, but failed to find lib.rs or main.rs
thread 'prop_merge' panicked at 'called `Result::unwrap()` on an `Err` value: MismatchScaled', src/core/tests/minhash.rs:324:16
...
thread 'prop_merge' panicked at 'called `Result::unwrap()` on an `Err` value: MismatchScaled', src/core/tests/minhash.rs:324:16
thread 'prop_merge' panicked at 'Test failed: called `Result::unwrap()` on an `Err` value: MismatchScaled; minimal failing input: seq1 = "GGGTCC", seq2 = "CTTTGA"
        successes: 0
        local rejects: 0
        global rejects: 0
', src/core/tests/minhash.rs:299:1


failures:
    load_save_minhash_sketches_abund
    prop_merge

test result: FAILED. 17 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out

error: test failed, to rerun pass '--test minhash'

any tip about this?

@xmnlab
Copy link
Contributor Author

xmnlab commented Sep 5, 2020

it seems the problem is the new function max_hash_for_scaled

In:

pub fn max_hash_for_scaled(scaled: u64) -> u64 {
    match scaled {
        0 => 0,
        1 => u64::max_value(),
        _ => (u64::max_value() as f64 / scaled as f64) as u64,
    }
}

pub fn scaled_for_max_hash(max_hash: u64) -> u64 {
    match max_hash {
        0 => 0,
        _ => u64::max_value() / max_hash,
    }
}

let scaled = 10;
let max_hash = max_hash_for_scaled(scaled);

println!("max_hash (from scaled): {}", max_hash);
println!("scaled: {}", scaled);
println!("scaled (from max_hash): {}", scaled_for_max_hash(max_hash));

Output:

max_hash (from scaled): 1844674407370955264
scaled: 10
scaled (from max_hash): 9

@luizirber any recommendation about this?

@xmnlab
Copy link
Contributor Author

xmnlab commented Sep 5, 2020

I used the same approach for scaled_for_max_hash (converting to float) and it seems it worked, at least for the tests.

I tested locally the conversion between scaled_for_max_hash and max_hash_for_scaled and it seems it won't work in some cases (as the conversion to integer lost information). not sure if it is relevant for this PR.

let me know if I should take another direction :) thanks!

@xmnlab xmnlab changed the title [WIP] Replace mx by scaled Replace mx by scaled Sep 7, 2020
@xmnlab xmnlab marked this pull request as ready for review September 7, 2020 01:22
@xmnlab
Copy link
Contributor Author

xmnlab commented Sep 7, 2020

@ctb @luizirber finally it is ready for review. thanks for the patience!

@ctb
Copy link
Contributor

ctb commented Sep 9, 2020

hi @xmnlab thank you! I am not competent to review the rust code changes in detail, but everything looks good to me on the Python side (which you basically didn't change - yay :). @luizirber is working to a deadline so may take a few more days before he gets to review this, but let me be the first to say "good job!"

@xmnlab
Copy link
Contributor Author

xmnlab commented Sep 9, 2020

Thank you so much for the feedback @ctb <3
I am still learning Rust .. but I will be very happy to help if there is any other issue for beginner :)

@xmnlab
Copy link
Contributor Author

xmnlab commented Oct 8, 2020

hey everyone! just a friendly reminder about this PR :) let me know if I should address anything else here :) thanks!

Comment on lines +245 to +248
lib.HASH_FUNCTIONS_MURMUR64_DAYHOFF if dayhoff else
lib.HASH_FUNCTIONS_MURMUR64_HP if hp else
lib.HASH_FUNCTIONS_MURMUR64_PROTEIN if is_protein else
lib.HASH_FUNCTIONS_MURMUR64_DNA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes sense that this works, but I've never seen it being used this way =]

Copy link
Member

@luizirber luizirber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks @xmnlab!

@luizirber luizirber merged commit 5e8b66a into sourmash-bio:latest Oct 8, 2020
@xmnlab xmnlab deleted the use-scaled-instead-of-max_hash branch October 8, 2020 23:13
@xmnlab
Copy link
Contributor Author

xmnlab commented Oct 8, 2020

thanks @ctb and @luizirber for all the patience and attention :)

@ctb
Copy link
Contributor

ctb commented Oct 9, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

update MinHash constructor API in Rust to use scaled instead of max_hash
3 participants