Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making PROFILE take in ONT data #199

Open
wants to merge 31 commits into
base: dev
Choose a base branch
from
Open

Conversation

simonleandergrimm
Copy link
Collaborator

This PR makes PROFILE take in ONT data. To achieve this, I added a second samtools process, and ribo identification using minimap2. Again, @harmonbhasin I'm using my own ribo reference for minimap2. Let me know if I should add the creation of said reference to the index workflow.

Copy link
Collaborator

@harmonbhasin harmonbhasin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I would add the creation of the human reference to the index workflow. Once you do that I can approve this.

ribo_path = "${ref_dir}/results/ribo-ref-concat.fasta.gz"
ribo_ch = BBDUK(reads_ch, ribo_path, min_kmer_fraction, k, bbduk_suffix, !single_end)
if (params.ont) {
ribo_path = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-ribo-index/ribo-ref-concat-unique.mmi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another reference that needs to be created in the index workflow and referenced in the index

process {
'''
input[0] = LOAD_SAMPLESHEET.out.samplesheet
input[1] = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-ribo-index/ribo-ref-concat-unique.mmi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

process {
"""
input[0] = LOAD_SAMPLESHEET.out.samplesheet
input[1] = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

} else {
ribo_path = "${ref_dir}/results/ribo-ref-concat.fasta.gz"
ribo_ch = BBDUK(reads_ch, ribo_path, min_kmer_fraction, k, ribo_suffix, !single_end)
}
// Run taxonomic profiling separately on ribo and non-ribo reads
tax_ribo_ch = TAXONOMY_RIBO(ribo_ch.match, kraken_db_ch, "D", bracken_threshold, single_end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I expect Kraken to do a bad job on generating a taxonomic profile here

But I still think this PR is in the right direction: if we want to do better taxonomic profiling for ONT data that should be a separately-prioritized followup

}

// Return aligned and unaligned reads separately as FASTQs
process SAMTOOLS_SEPARATE {
Copy link
Contributor

@willbradshaw willbradshaw Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This process is a superset of SAMTOOLS_FILTER. You don't need both.

(I would implement this as two calls to a general SAMTOOLS_FASTQ process with different input parameters, followed by a channel merge.)

ribo_ch = BBDUK(reads_ch, ribo_path, min_kmer_fraction, k, bbduk_suffix, !single_end)
if (params.ont) {
ribo_path = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-ribo-index/ribo-ref-concat-unique.mmi"
mapped_ch = MINIMAP2_RIBO(reads_ch, ribo_path, ribo_suffix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking from ignorance here, but why should BBDuk not work on ONT data? It's calculating a fraction of matching k-mers so read length per se shouldn't be fatal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonleandergrimm I'd still like this to be addressed before we move forward here, I remain uneasy about switching to an alignment-based approach without having a good understanding of how this might affect the results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willbradshaw Can you say more about what your worry is? Mostly that minimap2 is more resource-intensive? I think the files are small enough that this doesn't matter that much.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On priors I'd assume minimap2 to be better than bbduk (I'd assume the latter needs a bunch of playing around to find the fraction of matching k-mers that gives equivalent performance to minimap2).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, I'd be pretty keen to get the ONT version of the pipeline up and running fast and do optimization later (especially because this part of the pipeline shouldn't affect the Illumina runs).

'''
}
}
then {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented in another PR, I'd prefer a test of the actual content, not just that there is content.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now fixed.

@simonleandergrimm simonleandergrimm changed the base branch from simon-human-read-filtering to dev February 14, 2025 19:47
@simonleandergrimm
Copy link
Collaborator Author

simonleandergrimm commented Feb 17, 2025

@willbradshaw Before creating the new index, tests here will fail. It would still be good to get your feedback in the meantime.

tax_ribo_ch = TAXONOMY_RIBO(ribo_ch.match, kraken_db_ch, "D", bracken_threshold, single_end)
tax_noribo_ch = TAXONOMY_NORIBO(ribo_ch.nomatch, kraken_db_ch, "D", bracken_threshold, single_end)
tax_ribo_ch = TAXONOMY_RIBO(
ont ? ribo_ch.reads_mapped : ribo_ch.match,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not particularly happy with this, but I don't want to call the output of minimap2 "match" and "no_match"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is terrible, but given that the two branches of the if statement produce quite different things (k-mer matches vs alignments) it might be better to give them different names in the if statement, then define some new variables (maybe ribo_in and noribo_in) to hold them. Then the calls to TAXONOMY can take ribo_in and noribo_in as their first input.

Also, minor, but I'd prefer if we went back to having multiple arguments per line, I'm not a fan of these very long one-arg-per-line function calls.

tax_ribo_ch = TAXONOMY_RIBO(ribo_ch.match, kraken_db_ch, "D", bracken_threshold, single_end)
tax_noribo_ch = TAXONOMY_NORIBO(ribo_ch.nomatch, kraken_db_ch, "D", bracken_threshold, single_end)
tax_ribo_ch = TAXONOMY_RIBO(
ont ? ribo_ch.reads_mapped : ribo_ch.match,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is terrible, but given that the two branches of the if statement produce quite different things (k-mer matches vs alignments) it might be better to give them different names in the if statement, then define some new variables (maybe ribo_in and noribo_in) to hold them. Then the calls to TAXONOMY can take ribo_in and noribo_in as their first input.

Also, minor, but I'd prefer if we went back to having multiple arguments per line, I'm not a fan of these very long one-arg-per-line function calls.

ribo_ch = BBDUK(reads_ch, ribo_path, min_kmer_fraction, k, bbduk_suffix, !single_end)
if (params.ont) {
ribo_path = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-ribo-index/ribo-ref-concat-unique.mmi"
mapped_ch = MINIMAP2_RIBO(reads_ch, ribo_path, ribo_suffix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonleandergrimm I'd still like this to be addressed before we move forward here, I remain uneasy about switching to an alignment-based approach without having a good understanding of how this might affect the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants