The CRISPR-DAV pipeline can be run via a docker container or a physical installation.
The docker repository for CRISPR-DAV is called pinetree1/crispr-dav. It's based on the official Fedora image at Docker Hub, and has included the pipeline and prerequisite tools. No physical installation of them is required but you need to be able to run docker on your system.
The pipeline includes two example projects. Here are steps to test run example1. Running example2 is quite similar. You may replace /Users/xyz/temp with your own absolute path in the following commands.
(1) Start the container interactively and mount a path of host to the container:
docker run -it -v /Users/xyz/temp:/Users/xyz/temp pinetree1/crispr-dav
The docker image is about 1GB, and takes a few minutes to start up for the first time. This command mounts /Users/xyz/temp in the host to /Users/xyz/temp in the container. Inside the container, the pipeline's path is /crispr-dav.
(2) After starting up, at the container prompt, go to example1 directory:
cd /crispr-dav/Examples/example1
(3) Start the pipeline:
sh run.sh
(4) When the pipeline is finished, move the results to the shared directory in container:
mv deliverables /Users/xyz/temp
(5) Exit from the container:
exit
(6) On the host, open a browser to view the report, index.html, in /Users/xyz/temp/deliverables/GENEX_CR1.
The general steps for analyzing your own project via the docker are similar. You'll need to prepare a set of input files: conf.txt, amplicon.bed, site.bed, sample.site, fastq.list, and run.sh, similar to those in the examples; and prepare reference genome or amplicon sequence. The important thing is to share your data directories with the container. For example, assuming that there are 3 directories on the host related to your project:
/Users/xyz/temp/project: contains the input files.
/Users/xyz/temp/rawfastq: contains the fastq files.
/Users/xyz/temp/genome: contains the genome files.
You'll mount these directories to the container (using the same paths for convenience):
docker run -it -v /Users/xyz/temp/project:/Users/xyz/temp/project \
-v /Users/xyz/temp/rawfastq:/Users/xyz/temp/rawfastq \
-v /Users/xyz/temp/genome:/Users/xyz/temp/genome \
pinetree1/crispr-dav
cd /Users/xyz/temp/project
Then edit conf.txt, fastq.list, and run.sh to reflect the paths in the container.
Start the pipeline by: sh run.sh. The results will be present in the project directory of the container and the host. Due to the nature of BWA, your results could be slightly different from what's shown in Git repository README file.
The pipeline runs on Linux and MacOS. The installation on Linux is a bit simpler than on MacOS.
git clone https://github.com/pinetree1/crispr-dav.git
In the resulting crispr-dav directory, all the Perl programs (*.pl) use this line to invoke the perl in your environment: #!/usr/bin/env perl. The path of env on your system may differ. If so, the path should be changed accordingly in all *.pl files in crispr-dav directory.
The pipeline utilizes a set of tools, most of which are common in bioinformatics field. These include Perl and Python modules, R, and NGS tools.
A. Perl modules
The following modules are required but may not be present in default perl install.
Config::Tiny
Excel::Writer::XLSX
JSON
Run this command to check whether they are already installed:
perl -e "use <module>", e.g. perl -e "use Config::Tiny"
If there is no output, the module is already installed. Error message will show up if it's not installed.
If you have root privilege, installing a perl module could be simple:
sudo cpanm <module>, e.g. cpanm Config::Tiny
If you prefer to install modules as a non-root user, these steps show how to install Config::Tiny into local directory $HOME/perlmod:
wget http://search.cpan.org/CPAN/authors/id/R/RS/RSAVAGE/Config-Tiny-2.23.tgz
tar xvfz Config-Tiny-2.23.tgz
cd Config-Tiny-2.23
perl Makefile.PL INSTALL_BASE=$HOME/perlmod
make
make install
The modules can be found in CPAN. Install the other modules similarly. Keep in mind that if they have dependencies which are not already installed on your system, you will need to install them as well.
If a module is installed globally by root, it is already in @INC which has paths that perl searches for a module.
But if the module is installed in a local path, you'll need to add the path to @INC by setting PERL5LIB:
export PERL5LIB=$HOME/perlmod/lib/perl5:$PERL5LIB
B. NGS tools
-
ABRA: Assembly Based ReAligner. Recommended version: 0.97. Java 1.7 or later is needed to run the realigner.
Example installation by non-root user on Linux:
mkdir -p $HOME/app/ABRA cd $HOME/app/ABRA wget https://github.com/mozack/abra/releases/download/v0.97/abra-0.97-SNAPSHOT-jar-with-dependencies.jar (Pre-built jar for 64-bit Linux)
On MacOS, the jar file has to be re-built. The steps are a bit complex:
First, clone and make Google sparsehash temporarily: mkdir ~/temp cd ~/temp git clone https://github.com/sparsehash/sparsehash.git cd sparsehash ./configure make Second, download ABRA source file: cd ~/temp wget https://github.com/mozack/abra/archive/v0.97.tar.gz tar xvfz v0.97.tar.gz cd abra-0.97/src/main/c Now replace the abra's sparsehash with the new one: mv sparsehash sparsehash.old ln -s ~/temp/sparsehash/src/sparsehash Still in abra-0.97/src/main/c, create links to java library files(jni.h and jni_md.h): which java: this shows /usr/bin/java, for example. ls -l /usr/bin/java: shows it links to /System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands/Java. Then the two header files can be found in "Current" directory. ln -s /System/Library/Frameworks/JavaVM.framework/Versions/Current/Headers/jni.h ln -s /System/Library/Frameworks/JavaVM.framework/Versions/Current/Headers/jni_md.h Third, build the jar file. You'll need Maven and g++. which mvn: shows the mvn path. Otherwise install it from Apache. cd ~/temp/abra-0.97 make mv target/abra-0.97-SNAPSHOT-jar-with-dependencies.jar $HOME/app/ABRA
-
BWA: Burrows-Wheeler Aligner. Make sure your version supports "bwa mem -M" command, and bwa must be put in PATH for use by ABRA. Recommended version: 0.7.15.
Example install by non-root user:
cd $HOME/app wget https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.15.tar.bz2/download --no-check-certificate -O bwa-0.7.15.tar.bz2 tar xvfj bwa-0.7.15.tar.bz2 cd bwa-0.7.15 make
Be sure to put executable 'bwa' in your PATH, for example, by adding this line to $HOME/.bashrc assuming you are using bash:
export PATH=$HOME/app/bwa-0.7.15:$PATH
-
Samtools: Recommended version: 1.3.1. Older version of samtools is OK.
Example install by non-root user:
cd $HOME/app wget https://sourceforge.net/projects/samtools/files/samtools/1.3.1/samtools-1.3.1.tar.bz2/download --no-check-certificate -O samtools-1.3.1.tar.bz2 tar xvfj samtools-1.3.1.tar.bz2 ./configure make
-
Bedtools2: Make sure your version supports -F option in 'bedtools intersect' command. Recommended version: 2.25.0
Example install by non-root user:
cd $HOME/app wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz tar xvfz bedtools-2.25.0.tar.gz cd bedtools2 make
-
PRINSEQ: Recommended version: 0.20.4. Be sure to make the program prinseq-lite.pl executable:
Example install by non-root user:
cd $HOME/app wget https://sourceforge.net/projects/prinseq/files/standalone/prinseq-lite-0.20.4.tar.gz/download --no-check-certificate -O prinseq-lite-0.20.4.tar.gz tar xvfz prinseq-lite-0.20.4.tar.gz cd prinseq-lite-0.20.4 chmod +x prinseq-lite.pl
-
FLASH: Recommended version: 2, for merging paired-end reads.
Example install by non-root user:
cd $HOME/app git clone https://github.com/dstreett/FLASH2.git cd FLASH2 make
C. R packages
-
R packages: ggplot2, reshape2, naturalsort
To check whether a package is already installed, at R prompt, type for example:
>libarary(ggplot2). Absence of output means it is already installed.
To install the packages, after starting R, type:
>install.packages("ggplot2") >install.packages("reshape2") >install.packages("naturalsort")
If you get permission errors, check with your admin.
If you have to install R in a local directory, here are example steps:
cd $HOME/app wget https://cran.r-project.org/src/base/R-3/R-3.2.1.tar.gz tar xvfz R-3.2.1.tar.gz cd R-3.2.1 ./configure make Then install the packages as stated above.
D. Python program
Required: Pysamstats https://github.com/alimanfoo/pysamstats
To install it as root, the simple steps are:
sudo pip install pysam==0.8.4
sudo pip install pysamstats==0.24.3
These modules will be installed in system-wide location. No export of PYTHONPATH is needed.
To install it in home directory, you may try these steps:
-
Install prerequisite pysam module:
pip install --install-option="--prefix=$HOME" pysam==0.8.4
This would install pysam in $HOME/lib/python2.7/site-packages, assuming your Python version is 2.7 (The 'lib' could be lib64, depending on system).
Then make pysam module searchable:
export PYTHONPATH=$PYTHONPATH:$HOME/lib/python2.7/site-packages
-
Install pysamstats:
pip install --install-option="--prefix=$HOME" pysamstats==0.24.3
This would install pysamstats module in the same place as pysam module, and install an executable script $HOME/bin/pysamstats.
Check whether the modules can be loaded:
$ python
>>>import pysam
>>>import pysamstats
>>>exit()
If there is no output, the installation is successful.
You should add the export command to the pipeline's run.sh script, if the modules are installed by non-root user.
On Linux system, you may drop the version numbers (e.g, ==0.8.4) to install the most recent versions. However, on MacOS (at least X El Capitan), the recent verions (0.11.x) of pysam seem problematic, but the pair of pysam 0.8.4 and pysamstats 0.24.3 works alright.
CRISPR-DAV includes two examples in Examples directory. The example1 uses a genome as reference, whereas example2 uses an amplicon sequence as reference. The procedure to run the pipelines is similar in the examples.
cd crispr-dav/Examples/example1
Edit the conf.txt and run.sh accordingly. Remember to add commands of setting PERL5LIB and PYTHONPATH in run.sh if the Perl and Python modules were installed locally.
Start the pipeline: sh run.sh. This shell script invokes the main program crispr.pl which starts the pipeline.
The pipeline would create these directories:
-
align: contains the intermediate files. They can be removed once the HTML report is produced. For description of the file types, please check the README file in the directory.
Make sure not to put your source fastq files in this directory. They could be overwritten there.
-
deliverables: contains the results. The HTML report file index.html is in a subdirectory. Due to the nature of BWA, your results could be slightly different from what's shown in Git repository README file.
- Fastq files:
These are the raw fastq files. They must be gzipped with file extension .gz. Put the fastq files in a directory outside the pipeline's output directory. Don't put them inside the pipeline's "align" directory, as they could get overwritten.
- Reference files:
An amplicon sequence or a genome can be used as a reference. If an amplicon sequence is used for reference, all you need is a fasta file containing sequence ID and the sequence.
If a genome is used as reference, you'll prepare a fasta file, BWA index, and refGene coordinate files
A. Prepare fasta file:
For example, to parepare human genome hg19, download the chromosome sequence files from UCSC browser, uncompress and combine them into one file, e.g. hg19.fa.
B. Create Fasta index:
samtools faidx hg19.fa
C. Create bwa index:
bwa index hg19.fa
D. Download refGene table:
Go to UCSC Genome Broser (http://genome.ucsc.edu/cgi-bin/hgBlat), click Tools and select TableBrowser. Then make these selections:
Assembly: hg19
Group: Genes and Gene Predictions
Track: RefSeq Genes
Table: refGene
Region: Genome
Output format: all fields from selected table.
The downloaded tab-delimited file should have these columns: bin, name, chrom, strand, txStart, txEnd, cdsStart, cdsEnd, exonStarts, exonEnds,...
- amplicon.bed:
A tab-delimited text file with 6 columns for: chr, start, end, genesymbol, refseq_accession, strand. Only one amplicon is allowed. The start and end are 0-based, conforming to BED format. Genesymbol should have no space. Refseq_accession must match the value in the "name" field (2nd column) in genome's refGene table for the gene.
- site.bed:
A tab-delimited text file with 6 or 7 columns for: chr, start, end, crispr_name, sgRNA_sequence, strand, HDR_new_bases_and_positions. This file can contain multiple rows, but crispr_name and sgRNA_sequence must be unique. All the CRISPR sites must belong to the same amplicon. Start and end are 0-based.
The 7th field is optional. If HDR is performed, enter expected base changes in the field.
HDR format: ,,... The bases are desired new bases on positive strand, e.g.101900208C,101900229G,101900232C,101900235A. No space. The positions are 1-based.
- sample.site:
This file controls what samples to be analyzed. It's a tab-delimited text file with 2 or more columns for: sample name, sgRNA_sequence1, sgRNA_sequence2, ...
- fastq.list:
A tab-delimited text file with 2 or 3 columns for: sample name, read1 file, optional read2 file. Fastq files must be gzipped with .gz extension. The sample name must match what's in sample.site.
- conf.txt:
Use the conf.txt.template in crispr.pl script directory as template, modify the paths and settings accordingly.
Please note that none of the tab-delimited files should have a column header row. Each field should not contain space. The names of the input files can be changed.
- run.sh:
For convenience, you may use run.sh.template to create a wrapper script run.sh to start the pipeline. Edit the file accordingly. You may set module paths there.
- Errors before pipeline starts:
These are errors related to prerequisite tools and data inputs. For example, if a required tool or module is not found, there will be error message indicating the issue. You may try setting PERL5LIB and PYTHONPATH if the module is installed. If an input file used space instead of the required tab as separator, the pipeline would report error of missing columns.
-
Errors during pipeline:
There are several log files per sample. Check the README in 'align' directory for descriptions. For example, if the installed Bedtools version does not support "bedtools insert -F", there will be error messages in the .log file.