jupyter | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
V-pipe is a workflow designed for the analysis of next generation sequencing (NGS) data from viral pathogens. It produces a number of results in a curated format (e.g., consensus sequences, SNV calls, local/global haplotypes). V-pipe is written using the Snakemake workflow management system.
V-pipe is optimized for Linux or Mac OS systems. Therefore, we recommend users with a Windows system to install WSL2 - this is not a full virtual machine but rather a way to run Windows and Linux cooperatively at the same time.
V-Pipe takes as an input raw data in FASTQ format and depending on the user-defined configuration will output consensus sequences, SNV calls and local/global haplotypes.
V-pipe expects the input samples to be organized in a two-level hierarchy:
At the first level, input files are grouped by samples (e.g.: patients or biological replicates of an experiment).
At the second level, different datasets belonging to the same sample (e.g., from sample dates) are distinguished.
Inside the 2nd-level directory, the sub-directory raw_data
holds the sequencing data in FASTQ format (optionally compressed with GZip).
Paired-ended reads need to be in split files with suffixes _R1
and _R2
.
📁samples
├───📁patient1
│ └───📁date1
│ └───📁raw_data
│ ├───🧬reads_R1.fastq
│ └───🧬reads_R2.fastq
└───📁patient2
├───📁date1
│ └───📁raw_data
│ ├───🧬reads_R1.fastq
│ └───🧬reads_R2.fastq
└───📁date2
└───📁raw_data
├───🧬reads_R1.fastq
└───🧬reads_R2.fastq
In the directory example_HIV_data
you find a small test dataset that you can run on your workstation or laptop.
The files will have the following structure:
📁samples
├───📁CAP217
│ └───📁4390
│ └───📁raw_data
│ ├───🧬reads_R1.fastq
│ └───🧬reads_R2.fastq
└───📁CAP188
│───📁4
│ └───📁raw_data
│ ├───🧬reads_R1.fastq
│ └───🧬reads_R2.fastq
└───📁30
└───📁raw_data
├───🧬reads_R1.fastq
└───🧬reads_R2.fastq
V-pipe uses the Bioconda bioinformatics software repository for all its pipeline components. The pipeline itself is implemented using Snakemake.
For advanced users: If your are fluent with these tools, you can:
- directly download and install bioconda and snakemake,
- specifiy your V-pipe configuration, and start using V-pipe
Use --use-conda
to automatically download and install any further pipeline dependencies. Please refer to the documentation for additional instructions.
In this present tutorial you will learn how to setup a workflow for the example dataset.
To deploy V-pipe, you can use the installation script with the following parameters:
curl -O 'https://raw.githubusercontent.com/cbg-ethz/V-pipe/master/utils/quick_install.sh'
bash quick_install.sh -p testing -w work
Note that
- using
-p
specifies the subdirectory where to download and install snakemake and V-pipe - using
-w
will create a working directory and populate it. It will colloquial the references and the defaultconfig/config.yaml
, and create a handyvpipe
short-cut script to invokesnakemake
.
If you get zsh: permission denied: ./quick_install.sh
, run chmod +x quick_install.sh
this gives the necessary permissions.
Tip: To create and populate other new working directories, you can call init_project.sh from within the new directory:
mkdir -p working_2
cd working_2
../V-pipe/init_project.sh
Copy the samples directory you created in the step "Preparing a small dataset" to this working directory. (You can display the directory structure with tree testing/work/resources/samples
or find testing/work/resources/samples
.)
mkdir -p testing/work/resources
mv testing/V-pipe/docs/example_HIV_data/samples testing/work/resources/samples
Note that:
- by default V-pipe expects its samples in a directory
samples
contained directly in the working directory - i.e. `testing/work/sample`` - in this tutorial we put them inside the
resources
subdirectory, and will set the config file accordingly.
If you have a reference sequences that you would like to use for read mapping and alignment, then add it to the resources/reference/ref.fasta
directory. In our case, however, we will use the reference sequence HXB2 already provided by V-Pipe V-pipe/resources/hiv/HXB2.fasta
.
In the work
directory you can find the file config.yaml
. This is where the V-Pipe configuation should be specified. See here for the documentation of the configuration.
In this tutorial we are building our own configuration therefore virus_base_config
will remain empty. Since we are working with HIV-1, V-Pipe is providing meta information that will be used for visualisation (metainfo_file and gff_directory).
cat <<EOT > ./testing/work/config.yaml
general:
virus_base_config: ""
aligner: bwa
snv_caller: shorah
haplotype_reconstruction: haploclique
input:
reference: "{VPIPE_BASEDIR}/../resources/hiv/HXB2.fasta"
metainfo_file: "{VPIPE_BASEDIR}/../resources/hiv/metainfo.yaml"
gff_directory: "{VPIPE_BASEDIR}/../resources/hiv/gffs/"
datadir: resources/samples/
read_length: 301
samples_file: samples.tsv
paired: true
snv:
consensus: false
output:
snv: true
local: true
global: true
visualization: true
QA: false
diversity: true
EOT
Note: A YAML files use spaces as indentation, you can use 2 or 4 spaces for indentation, but no tab. There are also online YAML file validators that you might want to use if your YAML file is wrongly formatted.
Before running check what will be executed:
cd testing/work/
./vpipe --dryrun
cd ../..
As this is your first run of V-pipe, it will also generate the sample collection table. Check samples.tsv
in your editor.
Note that the samples you have downloaded have reads of length 301 only. V-pipe’s default parameters are optimized for reads of length 250. To adapt to the read length, add a third column in the tab-separated file as follows:
cat <<EOT > testing/work/samples.tsv
CAP217 4390 301
CAP188 4 301
CAP188 30 301
EOT
Always check the content of the samples.tsv
file.
If you did not use the correct directory structure, this file might end up empty or some entries might be missing.
You can safely delete it and re-run with option --dry-run
to regenerate it.
Finally, we can run the V-pipe analysis (the necessary dependencies will be downloaded and installed in conda environments managed by snakemake):
cd testing/work/
./vpipe -p --cores 2
The Wiki contains an overview of the output files. The output of the SNV calling step is aggregated in a standard VCF file, located in results/{hierarchy}/variants/SNVs/snvs.vcf
. You can open it with your favorite VCF tools for visualisation or downstream processing. It is also available in a tabular format in results/{hierarchy}/variants/SNVs/snvs.csv
.
The default configuration uses ShoRAH to call the SNVs and to reconstruct the local (windowed) haplotypes.
Components of the pipeline can be swapped simply by changing the config.yaml
file. For example to call SNVs using lofreq instead of ShoRAH use
general:
snv_caller: lofreq