Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making PROFILE take in ONT data #199

Open
wants to merge 31 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
bb278f3
Adding a manual flag for human read filtering.
simonleandergrimm Feb 8, 2025
0aab9dd
Adding stand-alone containers for minimap2 and samtools.
simonleandergrimm Feb 8, 2025
79cff17
Adding code for filtering out human reads on ont.
simonleandergrimm Feb 8, 2025
1ae7470
Added optional human read filtering to subsetTrim
simonleandergrimm Feb 8, 2025
9ff3db1
Added human_read_filtering param to subset_trim
simonleandergrimm Feb 8, 2025
2989cde
Merge branch 'dev' into simon-human-read-filtering
simonleandergrimm Feb 11, 2025
d8c6790
added comment to human read filtering.
simonleandergrimm Feb 11, 2025
acdafe9
dropping second human read filtering.
simonleandergrimm Feb 11, 2025
49af743
added dislcaimer re ONT
simonleandergrimm Feb 11, 2025
ca671f5
Added a test for minimap2
simonleandergrimm Feb 12, 2025
edde7be
Added a work in progress test for samtools. Doesn't yet work properly.
simonleandergrimm Feb 12, 2025
5950650
Merge branch 'dev' into simon-human-read-filtering
simonleandergrimm Feb 12, 2025
9707939
Added samtools and profile edits to make handling ONT data in PROFILE…
simonleandergrimm Feb 12, 2025
7652019
fixed human read filtering flag in run.config
simonleandergrimm Feb 12, 2025
a730568
Added proper tests and streaming.
simonleandergrimm Feb 12, 2025
3924195
added resource specification to samtools
simonleandergrimm Feb 12, 2025
259caec
adding comments to main.nf.test samtools
simonleandergrimm Feb 12, 2025
5b44d38
Merge branch 'dev' into simon-human-read-filtering
simonleandergrimm Feb 12, 2025
3808e9c
Merge branch 'simon-human-read-filtering' into simon-ont-profile-v2
simonleandergrimm Feb 12, 2025
347cb04
added streaming to samtools.
simonleandergrimm Feb 12, 2025
cb3afbb
Merge branch 'simon-human-read-filtering' into simon-ont-profile-v2
simonleandergrimm Feb 12, 2025
5bd7854
fixed testing for samtools
simonleandergrimm Feb 12, 2025
1e9f933
adding ribo index testing.
simonleandergrimm Feb 12, 2025
2730c11
Reverting the edits that were introduced through merging human_read_f…
simonleandergrimm Feb 14, 2025
683bd95
WIP change save, still need to fix reference generation.
simonleandergrimm Feb 14, 2025
6714709
Created custom container for samtools and minimap2
simonleandergrimm Feb 14, 2025
666b0bb
fixing minimap2 reference name.
simonleandergrimm Feb 15, 2025
af8eff0
fixed subset input params.
simonleandergrimm Feb 16, 2025
6eb0ba2
amended profile to use output of new minmap2 process.
simonleandergrimm Feb 17, 2025
64d649f
dropped samtools process and tests
simonleandergrimm Feb 17, 2025
085d8bd
resetting style of subset_trim input variable order.
simonleandergrimm Feb 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions configs/containers.config
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,16 @@ process {
//- bioconda::samtools=1.21
//- conda-forge::gzip=1.13
}
withLabel: minimap2_samtools {
container = "community.wave.seqera.io/library/minimap2_samtools:03e1e7cf6ec6695d"
// Built with Seqera Containers
//channels:
//- conda-forge
//- bioconda
//dependencies:
//- bioconda::minimap2=2.28
//- bioconda::samtools=1.21
}
withLabel: bbtools_samtools {
container = "community.wave.seqera.io/library/bbmap_samtools_gzip:fc8114c072e9de02"
// Built with Seqera Containers
Expand Down
41 changes: 41 additions & 0 deletions modules/local/minimap2/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
// Run minimap2 on a single input FASTQ file and partition reads based on alignment status
process MINIMAP2 {
label "large"
label "minimap2_samtools"
input:
tuple val(sample), path(reads)
path(index_dir)
val(suffix)
val(remove_sq)
output:
tuple val(sample), path("${sample}_${suffix}_minimap2_mapped.sam.gz"), emit: sam
tuple val(sample), path("${sample}_${suffix}_minimap2_mapped.fastq.gz"), emit: reads_mapped
tuple val(sample), path("${sample}_${suffix}_minimap2_unmapped.fastq.gz"), emit: reads_unmapped
tuple val(sample), path("${sample}_${suffix}_minimap2_in.fastq.gz"), emit: input
shell:
'''
set -euo pipefail
# Prepare inputs
idx="!{index_dir}/mm2_index.mmi"
sam="!{sample}_!{suffix}_minimap2_mapped.sam.gz"
al="!{sample}_!{suffix}_minimap2_mapped.fastq.gz"
un="!{sample}_!{suffix}_minimap2_unmapped.fastq.gz"

# Run pipeline
# Outputs a SAM file for all reads, which is then partitioned based on alignment status
# - First branch (samtools view -u -f 4 -) filters SAM to unaligned reads and saves FASTQ
# - Second branch (samtools view -u -F 4 -) filters SAM to aligned reads and saves FASTQ
# - Third branch (samtools view -h -F 4 -) also filters SAM to aligned reads and saves SAM
zcat !{reads} \
| minimap2 -a ${idx} /dev/fd/0 \
| tee \
>(samtools view -u -f 4 - \
| samtools fastq - | gzip -c > ${un}) \
>(samtools view -u -F 4 - \
| samtools fastq - | gzip -c > ${al}) \
| samtools view -h -F 4 - \
!{ remove_sq ? "| grep -v '^@SQ'" : "" } | gzip -c > ${sam}
# Link input to output for testing
ln -s !{reads} !{sample}_!{suffix}_minimap2_in.fastq.gz
'''
}
1 change: 1 addition & 0 deletions nf-test.config
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ config {
load "[email protected]"
load "[email protected]"
load "[email protected]"
load "[email protected]"
}

}
31 changes: 26 additions & 5 deletions subworkflows/local/profile/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ include { ADD_FIXED_COLUMN as ADD_KRAKEN_NORIBO } from "../../../modules/local/a
include { ADD_FIXED_COLUMN as ADD_BRACKEN_NORIBO } from "../../../modules/local/addFixedColumn"
include { CONCATENATE_TSVS as CONCATENATE_KRAKEN } from "../../../modules/local/concatenateTsvs"
include { CONCATENATE_TSVS as CONCATENATE_BRACKEN } from "../../../modules/local/concatenateTsvs"
include { MINIMAP2 as MINIMAP2_RIBO } from "../../../modules/local/minimap2"


/****************
| MAIN WORKFLOW |
Expand All @@ -27,16 +29,35 @@ workflow PROFILE {
ref_dir
min_kmer_fraction
k
bbduk_suffix
ribo_suffix
bracken_threshold
single_end
ont
main:
// Separate ribosomal reads
ribo_path = "${ref_dir}/results/ribo-ref-concat.fasta.gz"
ribo_ch = BBDUK(reads_ch, ribo_path, min_kmer_fraction, k, bbduk_suffix, !single_end)
if (ont) {
ribo_ref = "${projectDir}/results/mm2-ribo-index"
mapped_ch = MINIMAP2_RIBO(reads_ch, ribo_ref, ribo_suffix, false)
ribo_ch = mapped_ch
} else {
ribo_path = "${ref_dir}/results/ribo-ref-concat.fasta.gz"
ribo_ch = BBDUK(reads_ch, ribo_path, min_kmer_fraction, k, ribo_suffix, !single_end)
}
// Run taxonomic profiling separately on ribo and non-ribo reads
tax_ribo_ch = TAXONOMY_RIBO(ribo_ch.match, kraken_db_ch, "D", bracken_threshold, single_end)
tax_noribo_ch = TAXONOMY_NORIBO(ribo_ch.nomatch, kraken_db_ch, "D", bracken_threshold, single_end)
tax_ribo_ch = TAXONOMY_RIBO(
ont ? ribo_ch.reads_mapped : ribo_ch.match,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not particularly happy with this, but I don't want to call the output of minimap2 "match" and "no_match"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is terrible, but given that the two branches of the if statement produce quite different things (k-mer matches vs alignments) it might be better to give them different names in the if statement, then define some new variables (maybe ribo_in and noribo_in) to hold them. Then the calls to TAXONOMY can take ribo_in and noribo_in as their first input.

Also, minor, but I'd prefer if we went back to having multiple arguments per line, I'm not a fan of these very long one-arg-per-line function calls.

kraken_db_ch,
"D",
bracken_threshold,
single_end
)
tax_noribo_ch = TAXONOMY_NORIBO(
ont ? ribo_ch.reads_unmapped : ribo_ch.nomatch,
kraken_db_ch,
"D",
bracken_threshold,
single_end
)
// Add ribosomal status to output TSVs
kr_ribo = ADD_KRAKEN_RIBO(tax_ribo_ch.kraken_reports, "ribosomal", "TRUE", "ribo")
kr_noribo = ADD_KRAKEN_NORIBO(tax_noribo_ch.kraken_reports, "ribosomal", "FALSE", "noribo")
Expand Down
71 changes: 71 additions & 0 deletions tests/modules/local/minimap2/main.nf.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
nextflow_process {
name "Test process MINIMAP2"
script "modules/local/minimap2/main.nf"
process "MINIMAP2"
config "tests/run_dev_se.config"
tag "module"
tag "minimap2"

setup {
run("LOAD_SAMPLESHEET") {
script "subworkflows/local/loadSampleSheet/main.nf"
process {
"""
input[0] = "${projectDir}/test-data/ont-samplesheet.csv"
input[1] = true
"""
}
}
}

test("Should correctly partition reads based on alignment status") {
tag "expect_success"
tag "single_end"
when {
params {}
process {
'''
input[0] = LOAD_SAMPLESHEET.out.samplesheet
input[1] = "${params.ref_dir}/results/mm2-ribo-index"
input[2] = "test"
input[3] = false
'''
}
}
then {
// Should run without failures
assert process.success

// Check all expected output files exist
assert path(process.out.sam[0][1]).exists()
assert path(process.out.reads_mapped[0][1]).exists()
assert path(process.out.reads_unmapped[0][1]).exists()
assert path(process.out.input[0][1]).exists()

// Get read IDs from output FASTQ files
def fastq_mapped = path(process.out.reads_mapped[0][1]).fastq
def fastq_unmapped = path(process.out.reads_unmapped[0][1]).fastq
def fastq_read_ids_mapped = fastq_mapped.readNames.toSet()
def fastq_read_ids_unmapped = fastq_unmapped.readNames.toSet()

// Get read IDs from SAM file
def samlines = sam(process.out.sam[0][1]).getSamLines()
def sam_read_ids_mapped = samlines
.collect { line -> line.split('\t')[0] } // Get read IDs
.toSet()

// Get input read IDs
def input_reads = path(process.out.input[0][1]).fastq.readNames.toSet()

// Verify read partitioning is correct
assert sam_read_ids_mapped == fastq_read_ids_mapped

// Verify no overlapping reads between mapped and unmapped sets
assert fastq_read_ids_mapped.intersect(fastq_read_ids_unmapped).size() == 0

// Verify input FASTQ contains all reads
assert input_reads == fastq_read_ids_mapped + fastq_read_ids_unmapped
}
}
}

2 changes: 1 addition & 1 deletion workflows/run.nf
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ workflow RUN {

// Profile ribosomal and non-ribosomal reads of the subset adapter-trimmed reads
PROFILE(SUBSET_TRIM.out.trimmed_subset_reads, kraken_db_path, params.ref_dir, "0.4", "27", "ribo",
params.bracken_threshold, params.single_end)
params.bracken_threshold, params.single_end, params.ont)

// Publish results
params_str = JsonOutput.prettyPrint(JsonOutput.toJson(params))
Expand Down
2 changes: 1 addition & 1 deletion workflows/run_dev_se.nf
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ workflow RUN_DEV_SE {

// Profile ribosomal and non-ribosomal reads of the subset adapter-trimmed reads
PROFILE(SUBSET_TRIM.out.trimmed_subset_reads, kraken_db_path, params.ref_dir, "0.4", "27", "ribo",
params.bracken_threshold, params.single_end)
params.bracken_threshold, params.single_end, params.ont)

// Publish results
params_str = JsonOutput.prettyPrint(JsonOutput.toJson(params))
Expand Down