-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding human read filtering to subsetTrim #198
Adding human read filtering to subsetTrim #198
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Before merging, I would also add the creation of the human reference to the index workflow.
@@ -8,6 +8,9 @@ params { | |||
// Sequencing platform | |||
ont = <TRUE OR FALSE BASED ON SEQUENCING PLATFORM> // Whether the sequencing is ONT (true) or Illumina (false) | |||
|
|||
// Human filtering | |||
human_read_filtering = false // Whether to filter human reads. Only applicable to ONT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this only applies to ONT, better to include that in the config name, so no one is surprised later when it doesn't do anything on Illumina data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would probably be to keep this (and honestly ont too) out of run.config until the main run workflow can actually run ONT data; right now they're just confusing for the user. We have separate config files for configuring experimental runs.
@@ -30,6 +34,11 @@ workflow SUBSET_TRIM { | |||
} | |||
if (ont) { | |||
cleaned_ch = FILTLONG(inter_ch) | |||
if (human_read_filtering) { | |||
minimap2_human_index = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be checking references to personal buckets into mgs-workflow. This should pull from the configured index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding minimap2 indices to the index workflow now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely shouldn't be hardcoding any path into the workflow like this, personal bucket or otherwise.
process { | ||
''' | ||
input[0] = LOAD_SAMPLESHEET.out.samplesheet | ||
input[1] = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
process { | ||
""" | ||
input[0] = LOAD_SAMPLESHEET.out.samplesheet | ||
input[1] = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Have now added minimap2 index generation in index.nf. Once that PR goes through I'll update this PR and #199 to use appropriate index locations. |
@@ -8,6 +8,9 @@ params { | |||
// Sequencing platform | |||
ont = <TRUE OR FALSE BASED ON SEQUENCING PLATFORM> // Whether the sequencing is ONT (true) or Illumina (false) | |||
|
|||
// Human filtering | |||
human_read_filtering = false // Whether to filter human reads. Only applicable to ONT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should this only be applicable to ONT? Illumina data generated from Zephyr would also need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it shouldn't, but the code that does human read filtering is currently only able to take in ONT data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Currently only functional on ONT data"
@@ -8,6 +8,9 @@ params { | |||
// Sequencing platform | |||
ont = <TRUE OR FALSE BASED ON SEQUENCING PLATFORM> // Whether the sequencing is ONT (true) or Illumina (false) | |||
|
|||
// Human filtering | |||
human_read_filtering = false // Whether to filter human reads. Only applicable to ONT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would probably be to keep this (and honestly ont too) out of run.config until the main run workflow can actually run ONT data; right now they're just confusing for the user. We have separate config files for configuring experimental runs.
@@ -0,0 +1,22 @@ | |||
// Detection and removal of contaminant reads, using indices created for ONT cDNA data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This process doesn't do any read removal; indeed it doesn't return reads at all.
@@ -0,0 +1,21 @@ | |||
// Return reads that did not align to reference as FASTQ (streamed version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this should be a separate process from the minimap process. See the new BOWTIE2 process for an example of how to chain an aligner and samtools together to get separate mapped and unmapped reads; this also makes testing easier since you can directly compare the input and output FASTQs.
If you definitely want a standalone samtools fastq
process, I would make it more general than this, probably call it SAMTOOLS_FASTQ
and allow the user to pass in arbitrary argument strings rather than hardcoding -n -f 4
.
@@ -30,6 +34,11 @@ workflow SUBSET_TRIM { | |||
} | |||
if (ont) { | |||
cleaned_ch = FILTLONG(inter_ch) | |||
if (human_read_filtering) { | |||
minimap2_human_index = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely shouldn't be hardcoding any path into the workflow like this, personal bucket or otherwise.
then { | ||
// Should run without failures | ||
assert process.success | ||
// Both @SQ headers and alignments should be present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd ideally like a test for something a bit more substantive than just "is it empty". (I don't know enough about your process to know what that more substantive test should be, though)
// Should run without failures | ||
assert process.success | ||
|
||
// Output FASTQ ids should be identical to unmapped read ids in input SAM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more like it
@@ -56,7 +56,8 @@ workflow RUN { | |||
// Subset reads to target number, and trim adapters | |||
SUBSET_TRIM(samplesheet_ch, params.n_reads_profile, | |||
params.adapters, params.single_end, | |||
params.ont, params.random_seed) | |||
params.ont, params.random_seed, | |||
params.human_read_filtering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Order here doesn't match order in SUBSET_TRIM definition
@@ -38,7 +38,8 @@ workflow RUN_DEV_SE { | |||
// Subset reads to target number, and trim adapters | |||
SUBSET_TRIM(samplesheet_ch, params.n_reads_profile, | |||
params.adapters, params.single_end, | |||
params.random_seed, params.ont) | |||
params.ont, params.random_seed, | |||
params.human_read_filtering) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
@@ -30,6 +34,11 @@ workflow SUBSET_TRIM { | |||
} | |||
if (ont) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change currently seems a bit pointless to me. As-is, all the workflow downstream of this point does with human reads is count them (classify with Kraken, then count with Bracken, then return count tables). It doesn't save them anywhere. This change basically does the same thing in a different, uglier way (we can count the human reads as the number of reads lost).
I assume the reasoning behind this is something to do with privacy, but since we aren't analysing the human reads beyond classifying them (which we need to do anyway to decide which ones to discard) I'm not sure it's getting you anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe good to briefly chat about this point, I don't fully understand what you're referring to.
Closing this PR after conversation with @willbradshaw , will adopt some of the changes to #199 and a future extraviral reads PR. |
This PR adds a human read filtering step, which we need for Zephyr analysis. As part of this, I'm adding minimap2 and a samtools filtering step as processes which are used in subsetTrim.
Atm I'm still using a minimap2 reference that is in my own bucket. @harmonbhasin let me know if I should already fix this with this PR. I can also fix it in a future PR. I'm hesitant to touch the reference workflow, as rerunning that takes a lot of time.