Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

germline joint detect variants workflow #1043

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

apaul7
Copy link
Member

@apaul7 apaul7 commented Jul 15, 2021

This PR adds a joint detect variants subworkflow. This is split into 2 separate subworkflows, joint_detect_snps.cwl and joint_detect_svs.cwl. Tried my best to add informative commit messages. I've itemized most of the overall changes below:

  • tools/replace_vcf_sample_name.cwl
    update naming of output file to full name instead of just the base with renamed in front
  • tools/gather_to_subdirectory.cwl
    added --recursive, --preserve, and --no-clobber to the cp command. allows timestamps to be preserved and errors thrown if files are overwritten
    also added the valueFrom field which uses javascript to iterate over the input array and add any secondaryFiles to the destination directory
  • tools/gather_to_subdirectory_dirs.cwl(new tool)
    this follows the same format as the tools/gather_to_subdirectory.cwl format but is used for directories. I tried using the same tool for both use cases but was unable to make Cromwell happy
  • tools/bcftools_view.cwl(new tool)
    This tool is used to split multi sample vcfs into single sample vcfs.
  • tools/vt_normalize.cwl(new tool)
    uses VT to normalize a vcf, alternative to gatk LeftAlignAndTrimVariants found in tools/normalize_variants.cwl
  • subworkflows/joint_genotype.cwl
    added decompose and normalize steps
  • tools/manta_germline.cwl(new tool)
    follows tools/manta_somatic.cwl format. adds the stats directory to outputs and removes somatic and tumor only outputs.
  • tools/genotype_gvcfs.cwl
    added input for minimum confidence threshold for called variants
  • tools/custom_merge_sv_records.cwl(new tool)
    merges copy number called variants that have the same type, and are within x bases
  • tools/cnvnator.cwl
    updated output file names, s/CNV/cnv/
  • tools/annotsv_filter.cwl
    adds ability to merge survivor merged vcf, skips last filtering requirement. survivor does not pass INFO fields to merged vcf
    renamed all_cds to no_cds. easier to understand that input removes the coding sequence filter requirement
  • tools/annotsv.cwl
    updated to version 2.3. This is not the latest version. The latest version no longer retains information for individual sv population databases in output files
    renamed inputs for the new version
    added input for annotation directory instead of having them in the docker image
    added unannotated tsv output
  • subworkflows/merge_svs.cwl
    added inputs for population allele frequency, no_cds annotsv filtering, and anntosv annotation directory
    output file name replacement, s/SURVIVOR/survivor/
    added step for survivor merged annotsv filtering
  • subworkflows/gatk_soft_filter.cwl
    added subworkflow for gatk soft filter based on hard parameters to add PASS/FAIL
    https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering#2

apaul7 added 25 commits July 15, 2021 08:55
update input name to be clear that the no_cds filter does not run the coding sequences filter
allows filtering of survivor merged annotsv tsv. Also allow control over
the population allele frequency value, still defaults to 0.05.
version 2.3 requires the annotation directory to be passed as an input.
Also capture the unannotated event tsv as an output.
changing output names for consistency
added javascript to pass in any secondary files when staging output
files.
added --recursive to copy everything
added --preserve to keep timestamps(cromwell does not stage files for
this to matter...)
added --no-clobber to error out if files are overwritten
added optional directory input for staging files and a single directory.
This uses VT to normalize a VCF. This is an alternative to GATK4
LeftAlignTrimVariants
This allows vcfs to be split by samples
This allows manta to be ran with multiple samples in joint calling fashion
This runs cnvnator in single sample mode over multiple samples. The sample
rename step is required as the sample name in the output vcf can change
from the input. examples:
  input name -> output name
  sample.1 -> sample
  sample.1.2 -> sample
  sample_1 -> sample_1
also stages output vcf name to follow $SAMPLE.cnvnator.vcf.gz format
This runs cnvkit in single sample mode for multiple samples.
The sample rename step is required as output sample name in the vcf is
based on the input filename. Currently that is hardcoded to be
`adjusted.tumor`
also stage output file name to follow $SAMPLE.cnvkit.vcf.gz format
This subworkflow runs sv filtering for manta/smoove calls. Final sample
names follow the $SAMPLE-$CALLER format. This allows easy tracking for
the source of calls in output merged vcfs.
This runs the depth filters for events called by cnvkit/cnvnator. Final
sample names follow the $SAMPLE-$CALLER format. This allows easy
tracking for the source of calls in final merged vcfs.
added custom merge sv records. This allows calls to be merged together
if they are of the same type and within a bp window. This does not
remove calls just adds a new record in the output vcf.
This runs the sv callers in joint mode, merges, annotates, filters, and
stages the results in a directory structure
This generates per sample gvcf files, jointly calls variants with gatk,
annotates, filters, and stages the outputs.
This subworkflow calls the joint detect snps and joint detect svs
subworkflows outputing the staged results
jasonwalker80
jasonwalker80 previously approved these changes Sep 15, 2021
Copy link
Member

@jasonwalker80 jasonwalker80 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, @apaul7 I have not reviewed this in it's entirety. I've asked @tmooney to take a look as well. Both of us have reviewed the PR but not necessarily commit-by-commit or line-by-line. I'm going to give the "looks good to me", but if Tom can look mostly for places where these commits/changes may or may not impact other workflows/pipelines. Then let's merge.

Copy link
Member

@tmooney tmooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally think this looks okay. I'll trust that what it does is what you want to have happen.

Part of me sees all the references to SNPs has flashbacks to past instances where we've been told to go back and change it to SNVs instead, but I'll also trust that this name is the one we want to use 😄

definitions/subworkflows/gatk_soft_filter.cwl Outdated Show resolved Hide resolved
definitions/tools/bcftools_view.cwl Outdated Show resolved Hide resolved
definitions/tools/bcftools_view.cwl Outdated Show resolved Hide resolved
definitions/tools/custom_merge_sv_records.cwl Outdated Show resolved Hide resolved
definitions/tools/custom_merge_sv_records.cwl Outdated Show resolved Hide resolved
definitions/tools/gather_to_sub_directory_dirs.cwl Outdated Show resolved Hide resolved
definitions/tools/gather_to_sub_directory_dirs.cwl Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants