Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastmultigather returns matches only when --scaled is equal to scaled of the query #398

Open
AnneliektH opened this issue Jul 25, 2024 · 4 comments · May be fixed by #488
Open

fastmultigather returns matches only when --scaled is equal to scaled of the query #398

AnneliektH opened this issue Jul 25, 2024 · 4 comments · May be fixed by #488

Comments

@AnneliektH
Copy link

Hey, I am having some issues with fastmultigather and the scale of signatures:

Everything is in: /group/ctbrowngrp2/scratch/annie/2023-sourmash-viruses

I created protein sketches with the custom_sketch.py, that @ctb wrote.
Sorry I cannot find your original repo, but here is the script

It is one sketch per all proteins for 1 organism, in this case a virus.

I created sketches at 3 different ksizes (7,10,12), all at scaled 2. I then concatenated all 3 into one zip file: /group/ctbrowngrp2/scratch/annie/2023-sourmash-viruses/results/signatures/240703_RefSeq.proteins.zip

        python ../workflow/scripts/custom-sketch.py \
        {input.faa} --ksize {wildcards.ksize} --scaled 2 \
        -o {output.sig}

When I query this zipfile against the ICTV database using fastmultigather, I only get matches when the scaled is 2. I also want to use scaled=10 and scaled=100. I thought i could use lower scaled querys and that it automatically scales to the scale I am asking for.

        sourmash scripts fastmultigather \
        ../results/signatures/240703_RefSeq.proteins.zip
         /home/ntpierce/2023-spillover/output.vmr/vmr_MSL38_v1.protein.zip \
        -k {wildcards.ksize} --scaled {wildcards.scaled} \
        -m protein -c {threads} -t 0 

Only works when --scaled 2. If 10 or 100, output is:

No matches to 'NC_020858.2'
No matches to 'NC_048199.1'
No matches to 'NC_003387.1'

For all signatures within the zip

I then created protein sketches with custom_sketch.py at scaled=10 and scaled=100, and used those as queries. If I do it that way, I do get matches from fastmultigather.
Is this something that is not yet enabled for fastmultigather?

@luizirber luizirber transferred this issue from sourmash-bio/branchwater Jul 28, 2024
@ctb
Copy link
Collaborator

ctb commented Aug 13, 2024

ok, dug into this a little bit.

For a sketch against itself, with the original sketch at a scaled of 1000, you can't downsample like so:

sourmash scripts fastmultigather 47.sig.zip 47.sig.zip --scaled=10000

In addition, but slightly less? annoying, automatic downsampling doesn't work:

# create a 10k sig
sourmash sig downsample --scaled=10000 47.sig.zip -o 47.10k.sig.zip

# this now fails:
sourmash scripts fastmultigather 47.10k.sig.zip 47.sig.zip

# ...unless you specify -s 10000
sourmash scripts fastmultigather 47.10k.sig.zip 47.sig.zip -s 10000

so I think I can reproduce the problem.

@ctb
Copy link
Collaborator

ctb commented Aug 13, 2024

I think I've tracked down the problem -

this code in the branchwater plugin,

let selected = coll.select(selection)?;

calls this code in sourmash core,

https://github.com/sourmash-bio/sourmash/blob/bc2297050b6db10b144916700b4550ec50b26a8f/src/core/src/manifest.rs#L223-L228

but naively it seems to me like

row.scaled <= scaled as u64

should be doing the right thing? Here scaled is the query scaled and row.scaled is the database scaled, so in the above example it should be returning the right thing?

@ctb
Copy link
Collaborator

ctb commented Aug 18, 2024

It looks like I was wrong - adding some debugging prints, row.scaled is 10_000, scaled is 1000. So the condition fails.

@ctb
Copy link
Collaborator

ctb commented Oct 26, 2024

so confusing 😆 . Turns out the issue is that the query collection cannot be loaded with scaled=1000, which is set as the default. Possible solution is to set no default, and determine based on first sketch; other possible solutions exist :).

Fixing over in #488.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants