Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making PROFILE take in ONT data #199

Open
wants to merge 31 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
bb278f3
Adding a manual flag for human read filtering.
simonleandergrimm Feb 8, 2025
0aab9dd
Adding stand-alone containers for minimap2 and samtools.
simonleandergrimm Feb 8, 2025
79cff17
Adding code for filtering out human reads on ont.
simonleandergrimm Feb 8, 2025
1ae7470
Added optional human read filtering to subsetTrim
simonleandergrimm Feb 8, 2025
9ff3db1
Added human_read_filtering param to subset_trim
simonleandergrimm Feb 8, 2025
2989cde
Merge branch 'dev' into simon-human-read-filtering
simonleandergrimm Feb 11, 2025
d8c6790
added comment to human read filtering.
simonleandergrimm Feb 11, 2025
acdafe9
dropping second human read filtering.
simonleandergrimm Feb 11, 2025
49af743
added dislcaimer re ONT
simonleandergrimm Feb 11, 2025
ca671f5
Added a test for minimap2
simonleandergrimm Feb 12, 2025
edde7be
Added a work in progress test for samtools. Doesn't yet work properly.
simonleandergrimm Feb 12, 2025
5950650
Merge branch 'dev' into simon-human-read-filtering
simonleandergrimm Feb 12, 2025
9707939
Added samtools and profile edits to make handling ONT data in PROFILE…
simonleandergrimm Feb 12, 2025
7652019
fixed human read filtering flag in run.config
simonleandergrimm Feb 12, 2025
a730568
Added proper tests and streaming.
simonleandergrimm Feb 12, 2025
3924195
added resource specification to samtools
simonleandergrimm Feb 12, 2025
259caec
adding comments to main.nf.test samtools
simonleandergrimm Feb 12, 2025
5b44d38
Merge branch 'dev' into simon-human-read-filtering
simonleandergrimm Feb 12, 2025
3808e9c
Merge branch 'simon-human-read-filtering' into simon-ont-profile-v2
simonleandergrimm Feb 12, 2025
347cb04
added streaming to samtools.
simonleandergrimm Feb 12, 2025
cb3afbb
Merge branch 'simon-human-read-filtering' into simon-ont-profile-v2
simonleandergrimm Feb 12, 2025
5bd7854
fixed testing for samtools
simonleandergrimm Feb 12, 2025
1e9f933
adding ribo index testing.
simonleandergrimm Feb 12, 2025
2730c11
Reverting the edits that were introduced through merging human_read_f…
simonleandergrimm Feb 14, 2025
683bd95
WIP change save, still need to fix reference generation.
simonleandergrimm Feb 14, 2025
6714709
Created custom container for samtools and minimap2
simonleandergrimm Feb 14, 2025
666b0bb
fixing minimap2 reference name.
simonleandergrimm Feb 15, 2025
af8eff0
fixed subset input params.
simonleandergrimm Feb 16, 2025
6eb0ba2
amended profile to use output of new minmap2 process.
simonleandergrimm Feb 17, 2025
64d649f
dropped samtools process and tests
simonleandergrimm Feb 17, 2025
085d8bd
resetting style of subset_trim input variable order.
simonleandergrimm Feb 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions configs/containers.config
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,12 @@ process {
withLabel: python {
container = "community.wave.seqera.io/library/python:3.13.1--d00663700fcc8bcf"
}
withLabel: minimap2 {
container = "staphb/minimap2:2.28"
}
withLabel: samtools {
container = "staphb/samtools:1.21"
}
withLabel: coreutils_file {
container = "community.wave.seqera.io/library/coreutils_file:ccfe471e6d036f54"
// Build with Seqera Containers
Expand Down
3 changes: 3 additions & 0 deletions configs/run.config
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ params {
// Sequencing platform
ont = <TRUE OR FALSE BASED ON SEQUENCING PLATFORM> // Whether the sequencing is ONT (true) or Illumina (false)

// Human filtering
human_read_filtering = false // Whether to filter human reads. Only applicable to ONT.

// Directories
base_dir = <PATH TO YOUR DIRECTORY> // Parent for working and output directories (can be S3)
ref_dir = <PATH TO YOUR ORGS REFERENCE DIRECTORY> // Reference/index directory (generated by index workflow)
Expand Down
3 changes: 3 additions & 0 deletions configs/run_dev_se.config
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ params {
// Sequencing platform
ont = false // Whether the sequencing is ONT (true) or Illumina (false)

// Human filtering
human_read_filtering = false // Whether to filter human reads

// Directories
base_dir = "s3://nao-mgs-simon/test_single_read" // Parent for working and output directories (can be S3)
ref_dir = "s3://nao-mgs-wb/index/20241209/output" // Reference/index directory (generated by index workflow)
Expand Down
22 changes: 22 additions & 0 deletions modules/local/minimap2/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
// Detection and removal of contaminant reads, using indices created for ONT cDNA data
process MINIMAP2_ONT {
label "large"
label "minimap2"
input:
tuple val(sample), path(reads)
path(contaminant_ref)
val(suffix)
output:
tuple val(sample), path("${sample}_${suffix}_minimap2.sam"), emit: sam
tuple val(sample), path("${sample}_in.fastq.gz"), emit: input
shell:
'''
# Define input
o=!{sample}_!{suffix}_minimap2.sam
ref=!{contaminant_ref}
# Run minimap2
zcat !{reads} | minimap2 -a ${ref} /dev/fd/0 > ${o}
# Link input to output for testing
ln -s !{reads} !{sample}_in.fastq.gz
'''
}
46 changes: 46 additions & 0 deletions modules/local/samtools/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
// Return reads that did not align to reference as FASTQ (streamed version)
process SAMTOOLS_FILTER {
label "samtools"
label "small"
input:
tuple val(sample), path(sam)
val(suffix)
output:
tuple val(sample), path("${sample}_${suffix}.fastq.gz"), emit: reads
tuple val(sample), path("${sample}_in.sam"), emit: input
shell:
'''
# Define output
out=!{sample}_!{suffix}.fastq.gz
var="fastq -n -f 4"
# Execute samtools
cat !{sam} | samtools ${var} - | gzip > ${out}
# Link input to output for testing
ln -s !{sam} !{sample}_in.sam
'''
}

// Return aligned and unaligned reads separately as FASTQs
process SAMTOOLS_SEPARATE {
Copy link
Contributor

@willbradshaw willbradshaw Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This process is a superset of SAMTOOLS_FILTER. You don't need both.

(I would implement this as two calls to a general SAMTOOLS_FASTQ process with different input parameters, followed by a channel merge.)

label "samtools"
input:
tuple val(sample), path(sam)
val(suffix)
output:
tuple val(sample), path("${sample}_${suffix}_samtools_match.fastq.gz"), emit: match
tuple val(sample), path("${sample}_${suffix}_samtools_nomatch.fastq.gz"), emit: nomatch
tuple val(sample), path("${sample}_in.sam"), emit: input
shell:
'''
# Define output
om=!{sample}_!{suffix}_samtools_match.fastq.gz
on=!{sample}_!{suffix}_samtools_nomatch.fastq.gz
om_var="fastq -n -F 4"
on_var="fastq -n -f 4"
# Execute samtools
cat !{sam} | samtools ${om_var} - | gzip > ${om}
cat !{sam} | samtools ${on_var} - | gzip > ${on}
# Link input to output for testing
ln -s !{sam} !{sample}_in.sam
'''
}
1 change: 1 addition & 0 deletions nf-test.config
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ config {
load "[email protected]"
load "[email protected]"
load "[email protected]"
load "[email protected]"
}

}
17 changes: 14 additions & 3 deletions subworkflows/local/profile/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,11 @@ include { ADD_FIXED_COLUMN as ADD_BRACKEN_NORIBO } from "../../../modules/local/
include { CONCATENATE_TSVS as CONCATENATE_KRAKEN } from "../../../modules/local/concatenateTsvs"
include { CONCATENATE_TSVS as CONCATENATE_BRACKEN } from "../../../modules/local/concatenateTsvs"

if (params.ont) {
include { MINIMAP2_ONT as MINIMAP2_RIBO } from "../../../modules/local/minimap2"
include { SAMTOOLS_SEPARATE } from "../../../modules/local/samtools"
}

/****************
| MAIN WORKFLOW |
****************/
Expand All @@ -27,13 +32,19 @@ workflow PROFILE {
ref_dir
min_kmer_fraction
k
bbduk_suffix
ribo_suffix
bracken_threshold
single_end
main:
// Separate ribosomal reads
ribo_path = "${ref_dir}/results/ribo-ref-concat.fasta.gz"
ribo_ch = BBDUK(reads_ch, ribo_path, min_kmer_fraction, k, bbduk_suffix, !single_end)
if (params.ont) {
ribo_path = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-ribo-index/ribo-ref-concat-unique.mmi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another reference that needs to be created in the index workflow and referenced in the index

mapped_ch = MINIMAP2_RIBO(reads_ch, ribo_path, ribo_suffix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking from ignorance here, but why should BBDuk not work on ONT data? It's calculating a fraction of matching k-mers so read length per se shouldn't be fatal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonleandergrimm I'd still like this to be addressed before we move forward here, I remain uneasy about switching to an alignment-based approach without having a good understanding of how this might affect the results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willbradshaw Can you say more about what your worry is? Mostly that minimap2 is more resource-intensive? I think the files are small enough that this doesn't matter that much.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On priors I'd assume minimap2 to be better than bbduk (I'd assume the latter needs a bunch of playing around to find the fraction of matching k-mers that gives equivalent performance to minimap2).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, I'd be pretty keen to get the ONT version of the pipeline up and running fast and do optimization later (especially because this part of the pipeline shouldn't affect the Illumina runs).

ribo_ch = SAMTOOLS_SEPARATE(mapped_ch, ribo_suffix)
} else {
ribo_path = "${ref_dir}/results/ribo-ref-concat.fasta.gz"
ribo_ch = BBDUK(reads_ch, ribo_path, min_kmer_fraction, k, ribo_suffix, !single_end)
}
// Run taxonomic profiling separately on ribo and non-ribo reads
tax_ribo_ch = TAXONOMY_RIBO(ribo_ch.match, kraken_db_ch, "D", bracken_threshold, single_end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I expect Kraken to do a bad job on generating a taxonomic profile here

But I still think this PR is in the right direction: if we want to do better taxonomic profiling for ONT data that should be a separately-prioritized followup

tax_noribo_ch = TAXONOMY_NORIBO(ribo_ch.nomatch, kraken_db_ch, "D", bracken_threshold, single_end)
Expand Down
9 changes: 9 additions & 0 deletions subworkflows/local/subsetTrim/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,11 @@ include { SUBSET_READS_SINGLE_TARGET as SUBSET_SINGLE } from "../../../modules/l
include { SUBSET_READS_PAIRED_TARGET as SUBSET_PAIRED } from "../../../modules/local/subsetReads"
include { FASTP } from "../../../modules/local/fastp"
include { FILTLONG } from "../../../modules/local/filtlong"
include { MINIMAP2_ONT as MINIMAP2_HUMAN } from "../../../modules/local/minimap2"
include { SAMTOOLS_FILTER } from "../../../modules/local/samtools"
include { INTERLEAVE_FASTQ } from "../../../modules/local/interleaveFastq"


/***********
| WORKFLOW |
***********/
Expand All @@ -19,6 +22,7 @@ workflow SUBSET_TRIM {
adapter_path
single_end
ont
human_read_filtering
random_seed
main:
if (single_end) {
Expand All @@ -30,6 +34,11 @@ workflow SUBSET_TRIM {
}
if (ont) {
cleaned_ch = FILTLONG(inter_ch)
if (human_read_filtering) {
minimap2_human_index = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"
minimap2_ch = MINIMAP2_HUMAN(cleaned_ch, minimap2_human_index, "human")
cleaned_ch = SAMTOOLS_FILTER(minimap2_ch.sam, "no-human")
}
} else {
cleaned_ch = FASTP(inter_ch, adapter_path, !single_end)
}
Expand Down
69 changes: 69 additions & 0 deletions tests/modules/local/minimap2/main.nf.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
nextflow_process {

name "Test process MINIMAP2_ONT"
script "modules/local/minimap2/main.nf"
process "MINIMAP2_ONT"
config "tests/run_dev_se.config"
tag "module"
tag "minimap2"

setup {
run("LOAD_SAMPLESHEET") {
script "subworkflows/local/loadSampleSheet/main.nf"
process {
"""
input[0] = "${projectDir}/test-data/ont-samplesheet.csv"
input[1] = true
"""
}
}
}

test("When run against human index, should run without failures and return properly formatted SAM") {
tag "expect_success"
tag "single_end"
when {
params {}
process {
'''
input[0] = LOAD_SAMPLESHEET.out.samplesheet
input[1] = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"
input[2] = "human"
'''
}
}
then {
// Should run without failures
assert process.success
// Both @SQ headers and alignments should be present
def nHeaders = ["bash", "-c", "cat " + process.out.sam[0][1] + " | grep -c '^@SQ'"].execute().text.trim() as Integer
def nAlignments = ["bash", "-c", "cat " + process.out.sam[0][1] + " | grep -v '^@' | wc -l"].execute().text.trim() as Integer
assert nHeaders > 0
assert nAlignments > 0
}
}

test("When run against ribo index, should run without failures and return properly formatted SAM") {
tag "expect_success"
tag "single_end"
when {
params {}
process {
'''
input[0] = LOAD_SAMPLESHEET.out.samplesheet
input[1] = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-ribo-index/ribo-ref-concat-unique.mmi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

input[2] = "ribo"
'''
}
}
then {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented in another PR, I'd prefer a test of the actual content, not just that there is content.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now fixed.

// Should run without failures
assert process.success
// Both @SQ headers and alignments should be present
def nHeaders = ["bash", "-c", "cat " + process.out.sam[0][1] + " | grep -c '^@SQ'"].execute().text.trim() as Integer
def nAlignments = ["bash", "-c", "cat " + process.out.sam[0][1] + " | grep -v '^@' | wc -l"].execute().text.trim() as Integer
assert nHeaders > 0
assert nAlignments > 0
}
}
}
61 changes: 61 additions & 0 deletions tests/modules/local/samtools/filter.nf.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
nextflow_process {

name "Test process SAMTOOLS_FILTER"
script "modules/local/samtools/main.nf"
process "SAMTOOLS_FILTER"
config "tests/run_dev_se.config"
tag "module"
tag "samtools"

setup {
run("LOAD_SAMPLESHEET") {
script "subworkflows/local/loadSampleSheet/main.nf"
process {
"""
input[0] = "${projectDir}/test-data/ont-samplesheet.csv"
input[1] = true
"""
}
}
run("MINIMAP2_ONT") {
script "modules/local/minimap2/main.nf"
process {
"""
input[0] = LOAD_SAMPLESHEET.out.samplesheet
input[1] = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"
input[2] = "human"
"""
}
}
}

test("When run on SAM file, should only return unaligned reads") {
tag "expect_success"
tag "single_end"
when {
params {}
process {
'''
input[0] = MINIMAP2_ONT.out.sam
input[1] = "no-human"
'''
}
}
then {
// Should run without failures
assert process.success

// Output FASTQ ids should be identical to unmapped read ids in input SAM
def fastq_out = path(process.out.reads[0][1]).fastq
def read_ids_out = fastq_out.readNames.toSet()

def samlines = sam(process.out.input[0][1]).getSamLines()
def unmapped_read_ids = samlines
.findAll { line -> line.split('\t')[1] == '4' } // Only keep lines where flag = 4
.collect { line -> line.split('\t')[0] } // Get read IDs
.toSet() // Convert to Set to remove duplicates

assert unmapped_read_ids == read_ids_out
}
}
}
71 changes: 71 additions & 0 deletions tests/modules/local/samtools/separate.nf.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
nextflow_process {

name "Test process SAMTOOLS_SEPARATE"
script "modules/local/samtools/main.nf"
process "SAMTOOLS_SEPARATE"
config "tests/run_dev_se.config"
tag "module"
tag "samtools"

setup {
run("LOAD_SAMPLESHEET") {
script "subworkflows/local/loadSampleSheet/main.nf"
process {
"""
input[0] = "${projectDir}/test-data/ont-samplesheet.csv"
input[1] = true
"""
}
}
run("MINIMAP2_ONT") {
script "modules/local/minimap2/main.nf"
process {
"""
input[0] = LOAD_SAMPLESHEET.out.samplesheet
input[1] = "s3://nao-mgs-simon/ont-indices/2024-12-14/minimap2-human-index/chm13v2.0.mmi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

input[2] = "human"
"""
}
}
}

test("When run on a SAM file, should return aligned and unaligned reads separately") {
tag "expect_success"
tag "single_end"
when {
params {}
process {
'''
input[0] = MINIMAP2_ONT.out.sam
input[1] = "no-human"
'''
}
}
then {
// Should run without failures
assert process.success

// The two FASTQ output files should contain read ids of mapped and unmapped reads, respectively
def fastq_match = path(process.out.match[0][1]).fastq
def fastq_nomatch = path(process.out.nomatch[0][1]).fastq
def read_ids_match = fastq_match.readNames.toSet()
def read_ids_nomatch = fastq_nomatch.readNames.toSet()

def samlines = sam(process.out.input[0][1]).getSamLines()
def unmapped_read_ids = samlines
.findAll { line -> line.split('\t')[1] == '4' } // Only keep lines where flag = 4
.collect { line -> line.split('\t')[0] } // Get read IDs
.toSet() // Convert to Set to remove duplicates
def mapped_read_ids = samlines
.findAll { line -> line.split('\t')[1] != '4' } // Only keep lines where flag != 4
.collect { line -> line.split('\t')[0] } // Get read IDs
.toSet()

assert read_ids_nomatch == unmapped_read_ids
assert read_ids_match == mapped_read_ids

// The two FASTQ files should not contain any overlapping read ids
assert read_ids_match.intersect(read_ids_nomatch).size() == 0
}
}
}
3 changes: 3 additions & 0 deletions tests/run.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ params {
// Sequencing platform
ont = false // Whether the sequencing is ONT (true) or Illumina (false)

// Human filtering
human_read_filtering = false // Whether to filter human reads. Only applicable to ONT.

// Directories
base_dir = "./" // Parent for working and output directories (can be S3)
ref_dir = "s3://nao-testing/index/20250130/output/" // Reference/index directory (generated by index workflow)
Expand Down
Loading