PathogenTrack is an unsupervised computational software that uses unmapped single-cell RNAseq reads
to characterize intracellular pathogens
at the single-cell level. It is a python-based script that can be used to identify and quantify intracellular pathogenic viruses
and bacteria
reads at the single-cell level.
PathogenTrack has been tested on various scRNA-seq datasets derived from simulated and real datasets and performed robustly. The detailes are described in our paper PathogenTrack and Yeskit: tools for identifying intracellular pathogens from single-cell RNA-sequencing datasets as illustrated by application to COVID-19
.
PathogenTrack has been tested on Linux platform with CentOS 7 operation system. The RAM is 120 GB, with 40 computational threads.
1 . Installing Miniconda on Linux Platform. For details, please refer to Miniconda Installation.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
2 . Installing PathogenTrack.
conda env create -f environment.yml
Users can install the dependencies manually. The dependencies and test versions are listed below.
Package | Version |
---|---|
python | 3.6.10 |
biopython | 1.78 |
fastp | 0.12.4 |
star | 2.7.5a |
umi_tools | 1.1.1 |
kraken2 | 2.1.1 |
Download the Human GRCh38 genome and genome annotation file, and then decompress them:
wget ftp://ftp.ensembl.org/pub/release-101/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
gzip -d Homo_sapiens.GRCh38.dna.toplevel.fa.gz
wget ftp://ftp.ensembl.org/pub/release-101/gtf/homo_sapiens/Homo_sapiens.GRCh38.101.gtf.gz
gzip -d Homo_sapiens.GRCh38.101.gtf.gz
Build STAR Index with the following command:
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ./ \
--genomeFastaFiles ./Homo_sapiens.GRCh38.dna.toplevel.fa \
--sjdbGTFfile ./Homo_sapiens.GRCh38.101.gtf \
--sjdbOverhang 100
wget ftp://ftp.ccb.jhu.edu/pub/data/kraken2_dbs/minikraken_8GB_202003.tgz
tar zxf minikraken_8GB_202003.tgz
Before running this tutorial, you should run cellranger
or alevin
to get the single cells' gene expression matrix. Here, we take the simulated 10X sequencing data as an example:
First, we use cellranger to get scRNA-seq expression matrix and valid barcodes:
cellranger count --id cellranger_out --transcriptom /path/to/cellranger_database/
Then we run PathogenTrack to identify and quantify pathogen expression at the single-cell level:
conda activate PathogenTrack
python PathogenTrack.py count --project_id PathogenTrack_out --pattern CCCCCCCCCCCCCCCCNNNNNNNNNN \
--min_reads 10 --confidence 0.11 --star_index ~/database/STAR_index/ \
--kraken_db ~/database/minikraken_8GB_20200312/ --barcode barcodes.tsv \
--read1 simulation_S1_L001_R1_001.fastq.gz \
--read2 simulation_S1_L001_R2_001.fastq.gz
IMPORTANT: The Read 1 in the example is made up of 16 bp CB and 10 bp UMI, so the --pattern is CCCCCCCCCCCCCCCCNNNNNNNNNN (16C and 10N). Users must adjust the pattern with their own Read 1 accordingly.
Note: It may take 4-6 hours to complete one sample, and it depends on the performance of computational resources and the size of the raw single-cell data.
Please see QUICK_START.md for a complete tutorial.
If you have any questions/problems with PathogenTrack, feel free to leave an issue! We will try our best to provide support, address new issues, and keep improving this software.
PathogenTrack
and Yeskit
: tools for identifying intracellular pathogens from single-cell RNA-sequencing datasets as illustrated by application to COVID-19