Adding human read filtering to subsetTrim #198

simonleandergrimm · 2025-02-12T22:01:08Z

This PR adds a human read filtering step, which we need for Zephyr analysis. As part of this, I'm adding minimap2 and a samtools filtering step as processes which are used in subsetTrim.

Atm I'm still using a minimap2 reference that is in my own bucket. @harmonbhasin let me know if I should already fix this with this PR. I can also fix it in a future PR. I'm hesitant to touch the reference workflow, as rerunning that takes a lot of time.

harmonbhasin

LGTM. Before merging, I would also add the creation of the human reference to the index workflow.

jeffkaufman · 2025-02-13T16:04:06Z

configs/run.config

@@ -8,6 +8,9 @@ params {
    // Sequencing platform
    ont = <TRUE OR FALSE BASED ON SEQUENCING PLATFORM> // Whether the sequencing is ONT (true) or Illumina (false)

+    // Human filtering
+    human_read_filtering = false // Whether to filter human reads. Only applicable to ONT.


If this only applies to ONT, better to include that in the config name, so no one is surprised later when it doesn't do anything on Illumina data.

My preference would probably be to keep this (and honestly ont too) out of run.config until the main run workflow can actually run ONT data; right now they're just confusing for the user. We have separate config files for configuring experimental runs.

jeffkaufman · 2025-02-13T16:05:15Z

subworkflows/local/subsetTrim/main.nf

@@ -30,6 +34,11 @@ workflow SUBSET_TRIM {
        }
        if (ont) {
            cleaned_ch = FILTLONG(inter_ch)
+            if (human_read_filtering) {
+                minimap2_human_index = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"


I don't think we should be checking references to personal buckets into mgs-workflow. This should pull from the configured index.

Adding minimap2 indices to the index workflow now.

Definitely shouldn't be hardcoding any path into the workflow like this, personal bucket or otherwise.

jeffkaufman · 2025-02-13T16:05:34Z

tests/modules/local/minimap2/main.nf.test

+            process {
+                '''
+                input[0] = LOAD_SAMPLESHEET.out.samplesheet
+                input[1] = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"


jeffkaufman · 2025-02-13T16:05:53Z

tests/modules/local/samtools/main.nf.test

+            process {
+                """
+                input[0] = LOAD_SAMPLESHEET.out.samplesheet
+                input[1] = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"


simonleandergrimm · 2025-02-13T18:40:23Z

Have now added minimap2 index generation in index.nf. Once that PR goes through I'll update this PR and #199 to use appropriate index locations.

willbradshaw · 2025-02-12T23:59:36Z

configs/run.config

@@ -8,6 +8,9 @@ params {
    // Sequencing platform
    ont = <TRUE OR FALSE BASED ON SEQUENCING PLATFORM> // Whether the sequencing is ONT (true) or Illumina (false)

+    // Human filtering
+    human_read_filtering = false // Whether to filter human reads. Only applicable to ONT.


Why should this only be applicable to ONT? Illumina data generated from Zephyr would also need it.

Well it shouldn't, but the code that does human read filtering is currently only able to take in ONT data.

"Currently only functional on ONT data"

willbradshaw · 2025-02-13T21:33:04Z

configs/run.config

@@ -8,6 +8,9 @@ params {
    // Sequencing platform
    ont = <TRUE OR FALSE BASED ON SEQUENCING PLATFORM> // Whether the sequencing is ONT (true) or Illumina (false)

+    // Human filtering
+    human_read_filtering = false // Whether to filter human reads. Only applicable to ONT.


My preference would probably be to keep this (and honestly ont too) out of run.config until the main run workflow can actually run ONT data; right now they're just confusing for the user. We have separate config files for configuring experimental runs.

willbradshaw · 2025-02-13T21:34:53Z

modules/local/minimap2/main.nf

@@ -0,0 +1,22 @@
+// Detection and removal of contaminant reads, using indices created for ONT cDNA data


This process doesn't do any read removal; indeed it doesn't return reads at all.

willbradshaw · 2025-02-13T21:37:21Z

modules/local/samtools/main.nf

@@ -0,0 +1,21 @@
+// Return reads that did not align to reference as FASTQ (streamed version)


I'm not convinced this should be a separate process from the minimap process. See the new BOWTIE2 process for an example of how to chain an aligner and samtools together to get separate mapped and unmapped reads; this also makes testing easier since you can directly compare the input and output FASTQs.

If you definitely want a standalone samtools fastq process, I would make it more general than this, probably call it SAMTOOLS_FASTQ and allow the user to pass in arbitrary argument strings rather than hardcoding -n -f 4.

willbradshaw · 2025-02-13T21:37:52Z

subworkflows/local/subsetTrim/main.nf

@@ -30,6 +34,11 @@ workflow SUBSET_TRIM {
        }
        if (ont) {
            cleaned_ch = FILTLONG(inter_ch)
+            if (human_read_filtering) {
+                minimap2_human_index = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"


Definitely shouldn't be hardcoding any path into the workflow like this, personal bucket or otherwise.

willbradshaw · 2025-02-13T21:39:43Z

tests/modules/local/minimap2/main.nf.test

+        then {
+            // Should run without failures
+            assert process.success
+            // Both @SQ headers and alignments should be present


I'd ideally like a test for something a bit more substantive than just "is it empty". (I don't know enough about your process to know what that more substantive test should be, though)

willbradshaw · 2025-02-13T21:40:17Z

tests/modules/local/samtools/main.nf.test

+            // Should run without failures
+            assert process.success
+
+            // Output FASTQ ids should be identical to unmapped read ids in input SAM


This is more like it

willbradshaw · 2025-02-13T21:41:52Z

workflows/run.nf

@@ -56,7 +56,8 @@ workflow RUN {
    // Subset reads to target number, and trim adapters
    SUBSET_TRIM(samplesheet_ch, params.n_reads_profile,
        params.adapters, params.single_end,
-        params.ont, params.random_seed)
+        params.ont, params.random_seed,
+        params.human_read_filtering)


Order here doesn't match order in SUBSET_TRIM definition

willbradshaw · 2025-02-13T21:41:56Z

workflows/run_dev_se.nf

@@ -38,7 +38,8 @@ workflow RUN_DEV_SE {
    // Subset reads to target number, and trim adapters
    SUBSET_TRIM(samplesheet_ch, params.n_reads_profile,
        params.adapters, params.single_end,
-        params.random_seed, params.ont)
+        params.ont, params.random_seed,
+        params.human_read_filtering)


willbradshaw · 2025-02-13T21:45:39Z

subworkflows/local/subsetTrim/main.nf

@@ -30,6 +34,11 @@ workflow SUBSET_TRIM {
        }
        if (ont) {


This change currently seems a bit pointless to me. As-is, all the workflow downstream of this point does with human reads is count them (classify with Kraken, then count with Bracken, then return count tables). It doesn't save them anywhere. This change basically does the same thing in a different, uglier way (we can count the human reads as the number of reads lost).

I assume the reasoning behind this is something to do with privacy, but since we aren't analysing the human reads beyond classifying them (which we need to do anyway to decide which ones to discard) I'm not sure it's getting you anything.

Maybe good to briefly chat about this point, I don't fully understand what you're referring to.

simonleandergrimm · 2025-02-14T16:48:53Z

Closing this PR after conversation with @willbradshaw , will adopt some of the changes to #199 and a future extraviral reads PR.

simonleandergrimm added 16 commits February 8, 2025 10:12

Adding a manual flag for human read filtering.

bb278f3

Adding stand-alone containers for minimap2 and samtools.

0aab9dd

Adding code for filtering out human reads on ont.

79cff17

Added optional human read filtering to subsetTrim

1ae7470

Added human_read_filtering param to subset_trim

9ff3db1

Merge branch 'dev' into simon-human-read-filtering

2989cde

added comment to human read filtering.

d8c6790

dropping second human read filtering.

acdafe9

added dislcaimer re ONT

49af743

Added a test for minimap2

ca671f5

Added a work in progress test for samtools. Doesn't yet work properly.

edde7be

Merge branch 'dev' into simon-human-read-filtering

5950650

fixed human read filtering flag in run.config

7652019

Added proper tests and streaming.

a730568

added resource specification to samtools

3924195

adding comments to main.nf.test samtools

259caec

simonleandergrimm requested a review from harmonbhasin February 12, 2025 22:01

simonleandergrimm assigned harmonbhasin Feb 12, 2025

simonleandergrimm added 2 commits February 12, 2025 22:01

Merge branch 'dev' into simon-human-read-filtering

5b44d38

added streaming to samtools.

347cb04

harmonbhasin requested changes Feb 13, 2025

View reviewed changes

jeffkaufman requested changes Feb 13, 2025

View reviewed changes

jeffkaufman assigned simonleandergrimm and unassigned harmonbhasin Feb 13, 2025

willbradshaw requested changes Feb 13, 2025

View reviewed changes

simonleandergrimm changed the base branch from dev to simon-minimap-indices February 14, 2025 15:10

simonleandergrimm closed this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding human read filtering to subsetTrim #198

Adding human read filtering to subsetTrim #198

simonleandergrimm commented Feb 12, 2025 •

edited

Loading

harmonbhasin left a comment

jeffkaufman Feb 13, 2025

willbradshaw Feb 13, 2025

jeffkaufman Feb 13, 2025

simonleandergrimm Feb 13, 2025

willbradshaw Feb 13, 2025

jeffkaufman Feb 13, 2025

jeffkaufman Feb 13, 2025

simonleandergrimm commented Feb 13, 2025

willbradshaw Feb 12, 2025

simonleandergrimm Feb 13, 2025

jeffkaufman Feb 14, 2025 •

edited

Loading

willbradshaw Feb 13, 2025

willbradshaw Feb 13, 2025

willbradshaw Feb 13, 2025

willbradshaw Feb 13, 2025

willbradshaw Feb 13, 2025

willbradshaw Feb 13, 2025

willbradshaw Feb 13, 2025

willbradshaw Feb 13, 2025

willbradshaw Feb 13, 2025

simonleandergrimm Feb 14, 2025

simonleandergrimm commented Feb 14, 2025

		@@ -0,0 +1,22 @@
		// Detection and removal of contaminant reads, using indices created for ONT cDNA data

		@@ -0,0 +1,21 @@
		// Return reads that did not align to reference as FASTQ (streamed version)

Adding human read filtering to subsetTrim #198

Adding human read filtering to subsetTrim #198

Conversation

simonleandergrimm commented Feb 12, 2025 • edited Loading

harmonbhasin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonleandergrimm commented Feb 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffkaufman Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonleandergrimm commented Feb 14, 2025

simonleandergrimm commented Feb 12, 2025 •

edited

Loading

jeffkaufman Feb 14, 2025 •

edited

Loading