- Choose
GENOME
fromhg19
,hg38
,mm9
andmm10
and specify a destination directory.$ bash scripts/download_genome_data.sh [GENOME] [DESTINATION_DIR]
- Find a TSV file on the destination directory and use it for
"chip.genome_tsv"
in your input JSON.
-
Install Conda. Skip this if you already have equivalent Conda alternatives (Anaconda Python). Download and run the installer. Agree to the license term by typing
yes
. It will ask you about the installation location. On Stanford clusters (Sherlock and SCG4), we recommend to install it outside of your$HOME
directory since its filesystem is slow and has very limited space. At the end of the installation, chooseyes
to add Miniconda's binary to$PATH
in your BASH startup script.$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh $ bash Miniconda3-latest-Linux-x86_64.sh
-
Install pipeline's Conda environment.
$ bash scripts/uninstall_conda_env.sh # to remove any existing pipeline env $ bash scripts/install_conda_env.sh
-
Choose
GENOME
fromhg19
,hg38
,mm9
andmm10
and specify a destination directory. This will take several hours. We recommend not to run this installer on a login node of your cluster. It will take >8GB memory and >2h time.$ conda activate encode-chip-seq-pipeline $ bash scripts/build_genome_data.sh [GENOME] [DESTINATION_DIR]
-
Find a TSV file on the destination directory and use it for
"chip.genome_tsv"
in your input JSON.
-
You can build your own genome database if your reference genome has one of the following file types.
.fasta.gz
.fa.gz
.fasta.bz2
.fa.gz2
.2bit
-
Get a URL for your reference genome. You may need to upload it to somewhere on the internet.
-
Get a URL for a gzipped blacklist BED file for your genome. If you don't have one then skip this step. An example blacklist for hg38 is here.
-
Find the following lines in
scripts/build_genome_data.sh
and modify them as follows. Give a good name[YOUR_OWN_GENOME]
for your genome. ForMITO_CHR_NAME
use a correct mitochondrial chromosome name of your genome (e.g.chrM
orMT
). ForREGEX_BFILT_PEAK_CHR_NAME
Perl style regular expression must be used to keep regular chromosome names only in a blacklist filtered (.bfilt.
) peaks files. This.bfilt.
peak files are considered final peaks output of the pipeline and peaks BED files for genome browser tracks (.bigBed
and.hammock.gz
) are converted from these.bfilt.
peaks files. Chromosome name filtering withREGEX_BFILT_PEAK_CHR_NAME
will be done even without the blacklist itself.... elif [[ $GENOME == "YOUR_OWN_GENOME" ]]; then # Perl style regular expression to keep regular chromosomes only. # this reg-ex will be applied to peaks after blacklist filtering (b-filt) with "grep -P". # so that b-filt peak file (.bfilt.*Peak.gz) will only have chromosomes matching with this pattern # this reg-ex will work even without a blacklist. # you will still be able to find a .bfilt. peak file REGEX_BFILT_PEAK_CHR_NAME="chr[\dXY]+" # mitochondrial chromosome name (e.g. chrM, MT) MITO_CHR_NAME="chrM" # URL for your reference FASTA (fasta, fasta.gz, fa, fa.gz, 2bit) REF_FA="https://some.where.com/your.genome.fa.gz" # 3-col blacklist BED file to filter out overlapping peaks from b-filt peak file (.bfilt.*Peak.gz file). # leave it empty if you don't have one BLACKLIST= ...
-
Specify a destination directory for your genome database and run the installer. This will take several hours.
$ bash scripts/build_genome_data.sh [YOUR_OWN_GENOME] [DESTINATION_DIR]
-
Find a TSV file in the destination directory and use it for
"chip.genome_tsv"
in your input JSON.