The pipeline provides an easy and reproducible way to detect circRNA from pair-end FASTQ files using four methods: CIRIquant, Circexplorer2, find_circ, and circRNA_finder.
The scripts and logs for handling TCCIA cohorts are available under the run_batch_from_qc path.
-
(Optional) Set an independent linux account
circrna
for deploying and running circRNA identification pipeline. -
Install miniconda3 to default path, i.e.,
~/miniconda3
. When using recommended setting above, conda should be available at/home/circrna/miniconda3
. -
Install mamba to
base
env withconda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*'
. -
Install just with
curl --proto '=https' --tlsv1.2 -sSf https://just.systems/install.sh | bash -s -- --to ~/bin
. Please add~/bin
here to your$PATH
. You can change the~/bin
to anywhere, but you need to make thejust
available when you enter the terminal. -
Install rush and add its path to
$PATH
, similar tojust
. -
(Optional) Set registry of conda and pypi (pip) if necessary. For example, if you are in China, I recommend https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/ & https://mirrors.tuna.tsinghua.edu.cn/help/pypi/.
-
Now, clone this repo (
git clone [email protected]:coco/circrna-pipeline.git
). -
Install the conda environments one by one.
cd circrna-pipeline cd CIRIquant just install cd ../FindCirc just install cd ../Circexplorer2 just install cd ../circRNA_finder just install
Please make sure all the conda environments have been created with required softwares.
$ conda env list
# conda environments:
#
base * /home/circrna/miniconda3
CIRIquant /home/circrna/miniconda3/envs/CIRIquant
Circexplorer2 /home/circrna/miniconda3/envs/Circexplorer2
FindCirc /home/circrna/miniconda3/envs/FindCirc
circRNA_finder /home/circrna/miniconda3/envs/circRNA_finder
For running the pipeline, many files are required.
-
Prepare genome fasta file and gtf file. We use
GRCh38.primary_assembly.genome.fa
andgencode.v34.annotation.gtf
. -
For Circexplorer2, you need to download the reference file
hg38_ref_all.txt
(should be corresponding to your reference genome) withfetch_ucsc.py
script from theCircexplorer2
environment. -
Prepare align index, config_zhou.sh has recorded the commands. Please note, you need to activate corresponding environment before run index commands.
For example, to prepare index for CIRIquant.
source activate CIRIquant bwa index -a bwtsw -p /path/to/GRCh38.primary_assembly.genome.fa /path/to/GRCh38.primary_assembly.genome.fa hisat2-build -p 40 /path/to/GRCh38.primary_assembly.genome.fa /path/to/GRCh38.primary_assembly.genome.fa
-
For CIRIquant, a
yml
file is required to set the paths of softwares and files, e.g., hg38.yml. You need to modify the contents to fit your setting (You can also create anotheryml
file). -
Set a
config.sh
file, it sets all required setting with SHELL variables, config_zhou.sh is a good reference (Of cource, you can modify its contents to fit your needs).
You need to preprocess your pair-end fastq files (QC, cut adapters, etc.). fastp is a one-stop solution for this.
Currently, we only support file names with the postfix _1.fastq.gz
and _2.fastq.gz
.
Please make sure you output your processed fastq files in such a format.
Create a shell script with following settings and commands.
fqfile=./sample_list.txt
indir=/path/include/paired/fastq/files
oudir=/path/to/output
nthreads=20
config=/path/to/your/config.sh
common/ll_fq.py ${indir} --output ${fqfile}
nohup bash caller.sh ${fqfile} ${indir} ${oudir} ${nthreads} ${config} &> run.log &
The script has to be executed in conda
base
environment (orpython3
is installed). If you have prepared thesample_list.txt
file by your own. You can comment thecommon/ll_fq.py
row, and you can run the script in bash without any other requirement (e.g., no python3 is required from thebase
environment).
The directory run_batch have examples for running our TCCIA cohorts.
I recommend testing the pipeline with 4 samples. If it goes well, run all the data files you have. The pipeline will skip samples with result files already generated.
After getting the detection results from 4 methods, you can use the ensemble approach (code under aggr to get final results.
An example is given as:
workdir=/home/zhou/raid/IO_RNA/circRNA/PHS003316
bash -c "../../aggr/aggr_beds.R ${workdir} ${workdir}/aggr && ../../aggr/aggr_dataset.R ${workdir}/aggr ${workdir}/aggr ./PHS003316.txt" &> PHS003316_aggr.log
The output directory contains result files with names combined from sample names and methods.
$ ls *.bed
GO28753_ngs_rna_targrna_rnaaccess_EA_0f0fda909f_20150820.circexplorer2.bed
GO28753_ngs_rna_targrna_rnaaccess_EA_0f0fda909f_20150820.circRNA_finder.bed
GO28753_ngs_rna_targrna_rnaaccess_EA_0f0fda909f_20150820.CIRI.bed
GO28753_ngs_rna_targrna_rnaaccess_EA_0f0fda909f_20150820.find_circ.bed
The result file usually contains the postion and count value of circRNAs.
$ head GO28753_ngs_rna_targrna_rnaaccess_EA_0f0fda909f_20150820.circexplorer2.bed
chr3 9750944 9751949 + 31
chr3 11331339 11348035 + 16
chr3 12489783 12489989 + 19
chr3 12496517 12503784 + 1
chr3 15016151 15034809 + 11
chr3 15210801 15212212 + 12
chr3 15232518 15233973 + 1
chr3 15563357 15573242 - 3
chr3 17372074 17384015 - 2
chr3 18415110 18420991 - 9
We would like to thank all contributors of the following two projects. Our pipeline was inspired by the two works and could not be built without them.
© 2023 - OncoHarmony Network by Shixiang Wang, Yi Xiong, and Jian-Guo Zhou.