Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding human read filtering to subsetTrim #198

Closed

Conversation

simonleandergrimm
Copy link
Collaborator

@simonleandergrimm simonleandergrimm commented Feb 12, 2025

This PR adds a human read filtering step, which we need for Zephyr analysis. As part of this, I'm adding minimap2 and a samtools filtering step as processes which are used in subsetTrim.

Atm I'm still using a minimap2 reference that is in my own bucket. @harmonbhasin let me know if I should already fix this with this PR. I can also fix it in a future PR. I'm hesitant to touch the reference workflow, as rerunning that takes a lot of time.

Copy link
Collaborator

@harmonbhasin harmonbhasin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Before merging, I would also add the creation of the human reference to the index workflow.

@@ -8,6 +8,9 @@ params {
// Sequencing platform
ont = <TRUE OR FALSE BASED ON SEQUENCING PLATFORM> // Whether the sequencing is ONT (true) or Illumina (false)

// Human filtering
human_read_filtering = false // Whether to filter human reads. Only applicable to ONT.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this only applies to ONT, better to include that in the config name, so no one is surprised later when it doesn't do anything on Illumina data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would probably be to keep this (and honestly ont too) out of run.config until the main run workflow can actually run ONT data; right now they're just confusing for the user. We have separate config files for configuring experimental runs.

@@ -30,6 +34,11 @@ workflow SUBSET_TRIM {
}
if (ont) {
cleaned_ch = FILTLONG(inter_ch)
if (human_read_filtering) {
minimap2_human_index = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be checking references to personal buckets into mgs-workflow. This should pull from the configured index.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding minimap2 indices to the index workflow now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely shouldn't be hardcoding any path into the workflow like this, personal bucket or otherwise.

process {
'''
input[0] = LOAD_SAMPLESHEET.out.samplesheet
input[1] = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

process {
"""
input[0] = LOAD_SAMPLESHEET.out.samplesheet
input[1] = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@simonleandergrimm
Copy link
Collaborator Author

Have now added minimap2 index generation in index.nf. Once that PR goes through I'll update this PR and #199 to use appropriate index locations.

@@ -8,6 +8,9 @@ params {
// Sequencing platform
ont = <TRUE OR FALSE BASED ON SEQUENCING PLATFORM> // Whether the sequencing is ONT (true) or Illumina (false)

// Human filtering
human_read_filtering = false // Whether to filter human reads. Only applicable to ONT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this only be applicable to ONT? Illumina data generated from Zephyr would also need it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it shouldn't, but the code that does human read filtering is currently only able to take in ONT data.

Copy link
Member

@jeffkaufman jeffkaufman Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Currently only functional on ONT data"

@@ -8,6 +8,9 @@ params {
// Sequencing platform
ont = <TRUE OR FALSE BASED ON SEQUENCING PLATFORM> // Whether the sequencing is ONT (true) or Illumina (false)

// Human filtering
human_read_filtering = false // Whether to filter human reads. Only applicable to ONT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would probably be to keep this (and honestly ont too) out of run.config until the main run workflow can actually run ONT data; right now they're just confusing for the user. We have separate config files for configuring experimental runs.

@@ -0,0 +1,22 @@
// Detection and removal of contaminant reads, using indices created for ONT cDNA data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This process doesn't do any read removal; indeed it doesn't return reads at all.

@@ -0,0 +1,21 @@
// Return reads that did not align to reference as FASTQ (streamed version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this should be a separate process from the minimap process. See the new BOWTIE2 process for an example of how to chain an aligner and samtools together to get separate mapped and unmapped reads; this also makes testing easier since you can directly compare the input and output FASTQs.

If you definitely want a standalone samtools fastq process, I would make it more general than this, probably call it SAMTOOLS_FASTQ and allow the user to pass in arbitrary argument strings rather than hardcoding -n -f 4.

@@ -30,6 +34,11 @@ workflow SUBSET_TRIM {
}
if (ont) {
cleaned_ch = FILTLONG(inter_ch)
if (human_read_filtering) {
minimap2_human_index = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely shouldn't be hardcoding any path into the workflow like this, personal bucket or otherwise.

then {
// Should run without failures
assert process.success
// Both @SQ headers and alignments should be present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd ideally like a test for something a bit more substantive than just "is it empty". (I don't know enough about your process to know what that more substantive test should be, though)

// Should run without failures
assert process.success

// Output FASTQ ids should be identical to unmapped read ids in input SAM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more like it

@@ -56,7 +56,8 @@ workflow RUN {
// Subset reads to target number, and trim adapters
SUBSET_TRIM(samplesheet_ch, params.n_reads_profile,
params.adapters, params.single_end,
params.ont, params.random_seed)
params.ont, params.random_seed,
params.human_read_filtering)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order here doesn't match order in SUBSET_TRIM definition

@@ -38,7 +38,8 @@ workflow RUN_DEV_SE {
// Subset reads to target number, and trim adapters
SUBSET_TRIM(samplesheet_ch, params.n_reads_profile,
params.adapters, params.single_end,
params.random_seed, params.ont)
params.ont, params.random_seed,
params.human_read_filtering)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@@ -30,6 +34,11 @@ workflow SUBSET_TRIM {
}
if (ont) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change currently seems a bit pointless to me. As-is, all the workflow downstream of this point does with human reads is count them (classify with Kraken, then count with Bracken, then return count tables). It doesn't save them anywhere. This change basically does the same thing in a different, uglier way (we can count the human reads as the number of reads lost).

I assume the reasoning behind this is something to do with privacy, but since we aren't analysing the human reads beyond classifying them (which we need to do anyway to decide which ones to discard) I'm not sure it's getting you anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe good to briefly chat about this point, I don't fully understand what you're referring to.

@simonleandergrimm simonleandergrimm changed the base branch from dev to simon-minimap-indices February 14, 2025 15:10
@simonleandergrimm
Copy link
Collaborator Author

Closing this PR after conversation with @willbradshaw , will adopt some of the changes to #199 and a future extraviral reads PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants