-
Notifications
You must be signed in to change notification settings - Fork 6
Pangenome generation (old)
Pangenome databases are available for more than 400 species, see download page.
-
-c
to specify the clade or species database-name; -
-i_fna
input folder for genome sequences -
-i_gff
input folder for gene annotation files (gene location) -
--tmp
folder for saving temporary result file -
-o
output folder for the pangenome database -
--uc
provides additional files of the usearch7 clustering -
--verbose
to display progress information
./panphlan/panphlan_pangenome_generation.py -h
--i_ffn INPUT_FFN_FOLDER
Folder containing the .ffn gene sequence files
--i_fna INPUT_FNA_FOLDER
Folder containing the .fna genome sequence files
--i_gff INPUT_GFF_FOLDER
Folder containing the .gff gene annotation files
-c CLADE_NAME, --clade CLADE_NAME
Name of the species pangenome database, for example:
-c ecoli17
-o OUTPUT_FOLDER, --output OUTPUT_FOLDER
Result folder for all database files
--th IDENTITY_PERCENATGE
Threshold of gene sequence similarity (in percentage),
default: 95.0 %.
--tmp TEMP_FOLDER Folder for temporary files, default: TMP_panphlan_db
--uc Keep all usearch7 output files
--verbose Show progress information
-v, --version Prints the current PanPhlAn version and exits
panphlan_pangenome_generation.py
requires Usearch 7 or Roary
Example of generating a PanPhlAn pangenome database of Eubacterium rectale based on five reference genomes available at NCBI.
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/605/GCF_000020605.1_ASM2060v1/GCF_000020605.1_ASM2060v1_genomic.fna.gz
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/404/855/GCF_001404855.1_13414_6_44/GCF_001404855.1_13414_6_44_genomic.fna.gz
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/405/295/GCF_001405295.1_14207_7_7/GCF_001405295.1_14207_7_7_genomic.fna.gz
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/375/GCF_001406375.1_14207_7_91/GCF_001406375.1_14207_7_91_genomic.fna.gz
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/835/GCF_001406835.1_T1815/GCF_001406835.1_T1815_genomic.fna.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/605/GCF_000020605.1_ASM2060v1/GCF_000020605.1_ASM2060v1_genomic.gff.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/404/855/GCF_001404855.1_13414_6_44/GCF_001404855.1_13414_6_44_genomic.gff.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/405/295/GCF_001405295.1_14207_7_7/GCF_001405295.1_14207_7_7_genomic.gff.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/375/GCF_001406375.1_14207_7_91/GCF_001406375.1_14207_7_91_genomic.gff.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/835/GCF_001406835.1_T1815/GCF_001406835.1_T1815_genomic.gff.gz
Genomes are located in folder fna/
; gene annotation files are located in gff/
Rename files to have a short filename that will be used as genome-ID.
ls fna
GCF_000020605.fna.gz GCF_001404855.fna.gz GCF_001405295.fna.gz GCF_001406375.fna.gz GCF_001406835.fna.gz
ls gff
GCF_000020605.gff.gz GCF_001404855.gff.gz GCF_001405295.gff.gz GCF_001406375.gff.gz GCF_001406835.gff.gz
Consider also Roary as a recommended alternative to Usearch7 clustering.
./panphlan/panphlan_pangenome_generation.py -c erectale18 --i_fna fna/ --i_gff gff/ -o database/ --verbose
Option -c
specifies the species database name to use in PanPhlAn: erectale18
(Eubacterium rectale, version 2018).
Generated 9 database files are located in the -o output folder database/
and can be moved to the BOWTIE2_INDEXES directory, if exist.
ls database/
panphlan_erectale18.1.bt2
panphlan_erectale18.2.bt2
panphlan_erectale18.3.bt2
panphlan_erectale18.4.bt2
panphlan_erectale18_annotations.csv
panphlan_erectale18_centroids.ffn
panphlan_erectale18_pangenome.csv
panphlan_erectale18.rev.1.bt2
panphlan_erectale18.rev.2.bt2
mv database/panphlan_erectale18* $BOWTIE2_INDEXES
cd database/
../panphlan/panphlan_profile.py -c erectale18 --add_strains --o_dna genefamily_presence_absence.tsv
genefamily_presence_absence.tsv
contains the gene-family profiles of the reference genomes. It can be useful to detect outlier reference genomes, not related to the species.
What about plasmids and contigs? Each strain is represented by a single genome .fna fasta file and an additional .ffn or .gff file of gene sequences. All contigs and plasmids of a strain have to be in the same .fna multi-fasta file. In the same way, all gene information of a strain have to be in a single .ffn or .gff file.
See also:
How to find and download reference genomes from NCBI?
How to import Roary pangenome into PanPhlAn?
Screen your metagenomic samples for species related genes by mapping against the species database : PanPhlAn mapping
PanPhlAn is a project of the Computational Metagenomics Lab at CIBIO, University of Trento, Italy.
- PanPhlAn 3.0
- PanPhlAn 1.3