Skip to content

Commit

Permalink
Truongphikt add module calling from KTest-VN/calling@e461b84
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Jul 6, 2024
1 parent 6d34875 commit 51a3579
Show file tree
Hide file tree
Showing 21 changed files with 538 additions and 0 deletions.
1 change: 1 addition & 0 deletions modules/ktest/calling/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
bin/__pycache__
167 changes: 167 additions & 0 deletions modules/ktest/calling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Calling module

Calling variants from mark duplicated bam file ([PRS-62](https://ktest-dattn.atlassian.net/browse/PRS-62)).

------------------
# 1. Usages
Input params (view `conf/input.config`):
- `from_mapping_csv` path of sample sheet

Calling params (view `conf/calling.config`):

- `folder_ref`: the path of the folder contains reference genome
- `genome_name`: file name of fasta genome reference laid in folder_ref
- `human_knownsite_vcf`: Known-sites reference for BASE_RECALIBRATOR process in module Calling.


# 2. Channels
## 2.1 Input channels

<table class="tg" style="undefined;table-layout: fixed; width: 721px">
<colgroup>
<col style="width: 142px">
<col style="width: 579px">
<col style="width: 579px">
</colgroup>
<thead>
<tr>
<th class="tg-0pky"><span style="font-weight:bold">Channel</span></th>
<th class="tg-0pky"><span style="font-weight:bold">Value</span></th>
<th class="tg-0pky"><span style="font-weight:bold">Example</span></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-lboi">from_mapping</td>
<td class="tg-lboi">
- <span style="font-style:italic">key</span>: key from higher hierarchical structure (e.g. null, 'arrayA_batch4', ...) <br>
- <span style="font-style:italic">rg_id</span>: Group ID / Sample ID <br>
- <span style="font-style:italic">object</span>: Which object's sequence? e.g. human, shrimp, ...<br>
- <span style="font-style:italic">dedup_bam</span>: Path of bam file, after duplicate<br>
- <span style="font-style:italic">dedup_bai</span>: Path of *.bai file, index file of bam file.<br>
</td>
<td>([val(key), val(rg_id), val(object), path(dedup_bam), path(dedup_bai)])<br/>
</tr>
</tbody>
</table>

## 2.2 Reference channels

<table class="tg" style="undefined;table-layout: fixed; width: 721px">
<colgroup>
<col style="width: 142px">
<col style="width: 579px">
<col style="width: 579px">
</colgroup>
<thead>
<tr>
<th class="tg-0pky"><span style="font-weight:bold">Channel</span></th>
<th class="tg-0pky"><span style="font-weight:bold">Value</span></th>
<th class="tg-0pky"><span style="font-weight:bold">Example</span></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-0lax">calling_reference</td>
<td class="tg-0lax">
- <a href="https://ktest-dattn.atlassian.net/browse/PRS-93">folder_ref</a>: Folder contains reference genome and indexing *.{fa, fa.fai, dict}. <br>
- genome_ref: file name of fasta genome reference laid in folder_ref. (e.g. Homo_sapiens.GRCh38.dna.primary_assembly.fa) <br>
- <a href="https://ktest-dattn.atlassian.net/browse/PRS-96">human_knownsite_vcf</a>: Known-sites reference for BASE_RECALIBRATOR process in module Calling.
</td>
<td>
([folder_ref, genome_ref, human_knownsite_vcf])
</td>

</tr>
</tbody>
</table>

## 2.3. Output channels

<table class="tg" style="undefined;table-layout: fixed; width: 792px">
<colgroup>
<col style="width: 202px">
<col style="width: 590px">
<col style="width: 590px">
</colgroup>
<thead>
<tr>
<th class="tg-0pky"><span style="font-weight:bold">Channel</span></th>
<th class="tg-0pky"><span style="font-weight:bold">Value</span></th>
<th class="tg-0pky"><span style="font-weight:bold">Example</span></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-lboi">split_vcf</td>
<td class="tg-lboi"><span style="font-weight:400;font-style:normal">
- </span><span style="font-style:italic">object</span>: Which object's sequence? e.g. human, shrimp, ...<br>
- <span style="font-style:italic">chr</span>: Order number of chromosome (1→22)
- <span style="font-style:italic">split_vcf</span>: Cohort vcf file after split by chromosome and its index<br>
</td>
<td>[val(object), val(chr), [path(split_vcf), path(split_vcf_tbi)]]<br/></td>
</tr>
</tbody>
</table>

## 2.4. Processes

<table class="tg" style="undefined;table-layout: fixed; width: 721px">
<colgroup>
<col style="width: 142px" />
<col style="width: 579px" />
<col style="width: 579px" />
<col style="width: 579px" />
</colgroup>
<thead>
<tr>
<th class="tg-0pky"><span style="font-weight: 400;">Process</span></th>
<th class="tg-0pky"><span style="font-weight: 400;">Input Channel</span></th>
<th class="tg-0pky"><span style="font-weight: 400;">Output Channel</span></th>
<th class="tg-0pky"><span style="font-weight: 400;">Description</span></th>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-lboi">DRAFT_CALLING</td>
<td class="tg-lboi">[val(key), al(object), val(rg_id), path(dedup_bam), path(dedup_bai), path(folder_ref), val(genome_name)]</td>
<td>[val(key), val(object), val(rg_id), path(raw_variants_vcf)</td>
<td>Draft variant calling</td>
</tr>
<tr>
<td class="tg-lboi">DRAFT_JOIN</td>
<td class="tg-lboi">[val(key), val(object), path(raw_variants_vcf), path(folder_ref), val(genome_name)]</td>
<td>[val(key), val(object), path(joint_genotyped_draft{vcf.gz,vcf.gz.idx})]</td>
<td>Join draft variant VCF files into a draft cohort VCF file</td>
</tr>
<tr>
<td class="tg-lboi">BASE_RECALIBRATOR</td>
<td class="tg-lboi">[val(object), val(rg_id), path(dedup_bam), path(dedup_bai), val(joint_genotyped_draft_vcf), val(genome_name), path(human_knownsite_vcf)]</td>
<td>[val(object), val(rg_id), path(recal_data_table)]</td>
<td>Statistic impute and Overwrite original reported quality score in each mismatch with reference</td>
</tr>
<tr>
<td class="tg-lboi">APPLY_BQSR</td>
<td class="tg-lboi">[val(key), val(object), val(sample_id), path(dedup_bam), path(dedup_bai), path(recal_data_table), path(folder_ref), val(genome_name)]</td>
<td>[val(key), val(object), val(sample_id), path(recal_bam)]</td>
<td>Apply base quality score recalibration</td>
</tr>
<tr>
<td class="tg-lboi">CALL_VARIANTS</td>
<td class="tg-lboi">[val(key), val(object), val(rg_id), path(recal_bam)]</td>
<td>[val(key), val(object), val(rg_id), path(recal_vcf)]</td>
<td>Call germline SNPs and indels via local re-assembly of haplotypes</td>
</tr>
<tr>
<td class="tg-lboi">JOINING</td>
<td class="tg-lboi">[val(key), val(object), val(rg_id), path(variants_vcf), path(folder_ref), val(genome_name)</td>
<td>[val(key), val(object), path(joint_genotyped{vcf.gz,vcf.gz.tbi})]</td>
<td>Join variant VCF files into a draft cohort VCF file</td>
</tr>
<tr>
<td class="tg-lboi">SPLIT_CHR</td>
<td class="tg-lboi">[val(key), val(object), path(joint_genotyped), val(chr)</td>
<td>[val(key), val(object), val(chr), path(split_vcf{vcf.gz,vcf.gz.tbi})]</td>
<td>Split the cohort vcf file by chromosome</td>
</tr>
</table>
74 changes: 74 additions & 0 deletions modules/ktest/calling/calling.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
include { DRAFT_CALLING } from "./modules/draft_calling.nf"
include { DRAFT_JOIN } from "./modules/draft_join.nf"
include { BASE_RECALIBRATOR } from "./modules/base_recalibrator.nf"
include { APPLY_BQSR } from "./modules/apply_bqsr.nf"
include { CALL_VARIANTS } from "./modules/call_variants.nf"
include { JOINING } from "./modules/joining.nf"
include { SPLIT_CHR } from "./modules/split_chr.nf"

workflow CALLING{
take:
from_mapping // ([val(key), val(object), val(rg_id), path(dedup_bam), path(dedup_bai)])
calling_reference // ([path(folder_ref), val(genome_ref_name), [human_knownsite_vcf, human_knownsite_vcf_tbi]])

main:
// Split branch human object and others
from_mapping.branch{
human: it[1] == 'human'
return it + [["/dev/null"]] // ([val(key), val(object), val(rg_id), path(dedup_bam), path(dedup_bai), [path(null_file)]])
others: true // ([val(key), val(object), val(rg_id), path(dedup_bam), path(dedup_bai)])
}.set{ human_selector }

//===============IF NOT BEING HUMAN OBJECT==================
DRAFT_CALLING{
human_selector.others // ([val(key), val(object), val(rg_id), path(dedup_bam), path(dedup_bai)])
.combine(calling_reference.map{it[0,1]}) // ([val(key), val(object), val(rg_id), path(dedup_bam), path(dedup_bai), path(folder_ref), val(genome_name)])
}

DRAFT_JOIN{
DRAFT_CALLING.out.raw_variants_vcf // ([val(key), val(object), val(rg_id), path(raw_variants_vcf_gz)])
.groupTuple(by: [0,1]) // ([val(key), val(object), [val(rg_id), ...], [path(raw_variants_vcf_gz), ...]])
.combine(calling_reference.map{it[0,1]}) // ([val(key), val(object), [val(rg_id), ...], [path(raw_variants_vcf_gz),...], path(folder_ref), val(genome_name)])
}


non_human_pkg = from_mapping.combine(
DRAFT_JOIN.out.joint_genotyped_draft,
by: [0,1] // ([val(key), val(object), val(rg_id), path(dedup.bam), path(dedup.bai), [path(cohort_draft_vcf), path(cohort_draft_vcf_idx)]])
)
//==========================================================
BASE_RECALIBRATOR{
human_selector.human
.concat(non_human_pkg) // ([val(key), val(object), val(rg_id), path(dedup.bam), path(dedup.bai), [path(cohort_draft_vcf), path(cohort_draft_vcf_idx)]])
.combine(calling_reference.map{it[0..2]}) // ([val(key), val(object), val(rg_id), path(dedup.bam), path(dedup.bai), [path(cohort_draft_vcf), path(cohort_draft_vcf_idx)], path(folder_ref), val(genome_name), path(human_knownsite_vcf)])

}

APPLY_BQSR(
from_mapping.combine(
BASE_RECALIBRATOR.out.recal_data_table, by:[0,1,2] // ([val(key), val(object), val(rg_id), path(dedup.bam), path(dedup.bai), path(recal_data_table)])
).combine(calling_reference.map{it[0,1]}) // ([val(key), val(object), val(rg_id), path(dedup.bam), path(dedup.bai), path(recal_data_table), path(folder_ref), val(genome_name)])
)

CALL_VARIANTS(
APPLY_BQSR.out.recal_bam // ([val(key), val(object), val(rg_id), path(recal_bam)])
.combine(calling_reference.map{it[0,1]}) // ([val(key), val(object), val(rg_id), path(recal_bam), path(folder_ref), val(genome_name)])
)

JOINING(
CALL_VARIANTS.out.variants_recal_vcf // ([val(key), val(object), val(rg_id), [path(variants_recal_vcf_gz), path(variants_recal_vcf_gz_tbi)]])
.map{it.flatten()} // ([val(key), val(object), val(rg_id), path(variants_recal_vcf_gz), path(variants_recal_vcf_gz_tbi)])
.groupTuple(by: [0,1]) // ([val(key), val(object), [rg_id1, rg_id2, ...], [path(variants_recal_vcf_gz),...)], [path(variants_recal_vcf_gz_tbi), ...]])
.combine(calling_reference.map{it[0,1]}) // ([val(key), val(object), [rg_id1, rg_id2, ...], [path(variants_recal_vcf_gz),...)], [path(variants_recal_vcf_gz_tbi), ...], path(folder_ref), val(genome_name)])

)

// Split by chromosomes
SPLIT_CHR(
JOINING.out.cohort_vcf // ([val(key), val(object), [path(joint_genotyped_vcf_gz), path(joint_genotyped_vcf_gz_tbi)]])
.combine(Channel.of(1..22).map{it.toString()}) // ([val(key), val(object), [path(joint_genotyped_vcf_gz), path(joint_genotyped_vcf_gz_tbi)], val(chr)])
)

emit:
split_vcf = SPLIT_CHR.out.split_vcf // ([val(key), val(object), val(chr), [path(split_vcf), path(split_vcf_tbi)]])
}
3 changes: 3 additions & 0 deletions modules/ktest/calling/conf/base.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
params {
cache_sing_folder = "/home/ktest/pipeline_env/software/truongphi"
}
5 changes: 5 additions & 0 deletions modules/ktest/calling/conf/calling.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
params {
folder_ref = "/home/ktest/pipeline_env/database/Variant_Calling/hg38"
genome_name = "Homo_sapiens.GRCh38.dna.primary_assembly.fa"
human_knownsite_vcf = "/home/ktest2/project/PRS/PRS-54/PRS-96/*.{vcf.gz,vcf.gz.tbi}"
}
3 changes: 3 additions & 0 deletions modules/ktest/calling/conf/input.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
params{
from_mapping_csv = ""
}
19 changes: 19 additions & 0 deletions modules/ktest/calling/conf/ktest_cluster.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
executor{
name = 'slurm'
queueSize = 30
}

process{

errorStrategy = { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries = 5

queue = 'prod'
maxForks = 30
}

singularity{
enabled = true
cacheDir = "$params.cache_sing_folder"
runOptions = "--bind /home/"
}
4 changes: 4 additions & 0 deletions modules/ktest/calling/conf/test.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
params {
from_mapping = "/home/ktest/project/truongphi/PRS/PRS-21/PRS-66/from_mapping.tsv"
bam_folder = "/home/ktest/share/Working_folder/TRUONGPHI/test_data/test_calling"
}
3 changes: 3 additions & 0 deletions modules/ktest/calling/conf/test/test_human.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
params {
from_mapping_tsv = "$projectDir/tests/test_sample_human.tsv"
}
6 changes: 6 additions & 0 deletions modules/ktest/calling/conf/test/test_pig.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
params {
from_mapping_tsv = "$projectDir/tests/test_sample_pig.tsv"

folder_ref = "/home/ktest2/project/PRS/PRS-62/PRS-199/test_data/pig_reference"
genome_name = "GCA_000003025.6_Sscrofa11.1_genomic.fa"
}
21 changes: 21 additions & 0 deletions modules/ktest/calling/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/usr/bin/env nextflow
include { CALLING } from "./calling.nf"

workflow{
//INPUT CHANEL
from_mapping = channel.fromPath("$params.from_mapping_tsv")
.splitCsv(skip: 1, sep: '\t') // ([val(key), val(rg_id), val(object), path(dedup_bam), path(dedup_bai)])

calling_reference = channel.fromPath("${params.folder_ref}")
.combine(channel.of("${params.genome_name}"))
.combine(channel.fromPath("${params.human_knownsite_vcf}").collect().map{[it]}) // ([path(folder_ref), val(genome_ref_name), [human_knownsite_vcf, human_knownsite_vcf_tbi])
//CALLING
CALLING(
from_mapping,
calling_reference
)

//EMIT
//CALLING.split_vcf // ([val(key), val(object), val(chr), [path(split_vcf), path(split_vcf_tbi)]])

}
28 changes: 28 additions & 0 deletions modules/ktest/calling/modules/apply_bqsr.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
process APPLY_BQSR{

tag "$key:$object:$sample_id"

container "phinguyen2000/gatk_tabix:v0.1.0"
memory { 20.GB * task.attempt }
cpus { 5 * task.attempt }

input:
tuple val(key), val(object), val(sample_id), path(dedup_bam), path(dedup_bai), path(recal_data_table), path(folder_ref), val(genome_name)


output:
tuple val(key), val(object), val(sample_id), path("${key_string}_${object}_${sample_id}.recal.bam"), emit: recal_bam

script:
key_string = key ? key.join("-") : key

"""
gatk ApplyBQSR \
-R "$folder_ref/$genome_name" \
-I $dedup_bam \
-bqsr $recal_data_table\
-O "${key_string}_${object}_${sample_id}.recal.bam"
"""

}
32 changes: 32 additions & 0 deletions modules/ktest/calling/modules/base_recalibrator.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
process BASE_RECALIBRATOR{
tag "$key:$object:$rg_id"

container "phinguyen2000/gatk_tabix:v0.1.0"
memory { 20.GB * task.attempt }
cpus { 5 * task.attempt }

input:
tuple val(key), val(object),
val(rg_id), path(dedup_bam),
path(dedup_bai), path(joint_genotyped_draft_vcf),
path(folder_ref), val(genome_name),
path(human_knownsite_vcf)
output:
tuple val(key), val(object), val(rg_id), path("${key_string}_${rg_id}.recal_data.table"), emit: recal_data_table

script:
if (joint_genotyped_draft_vcf.getName() == 'null'){
known_site = "${human_knownsite_vcf[0]}"
}else{
known_site = joint_genotyped_draft_vcf[0]
}
key_string = key ? key.join("-") : key

"""
gatk BaseRecalibrator \
-I $dedup_bam \
-R "$folder_ref/$genome_name" \
--known-sites $known_site \
-O ${key_string}_${rg_id}.recal_data.table
"""
}
Loading

0 comments on commit 51a3579

Please sign in to comment.