Skip to content

Pangenome generation (old)

Leonard Dubois edited this page May 18, 2020 · 1 revision

Pangenome databases are available for more than 400 species, see download page.

Options

  • -c to specify the clade or species database-name;
  • -i_fna input folder for genome sequences
  • -i_gff input folder for gene annotation files (gene location)
  • --tmp folder for saving temporary result file
  • -o output folder for the pangenome database
  • --uc provides additional files of the usearch7 clustering
  • --verbose to display progress information

Help -h

./panphlan/panphlan_pangenome_generation.py -h
  --i_ffn INPUT_FFN_FOLDER
                        Folder containing the .ffn gene sequence files
  --i_fna INPUT_FNA_FOLDER
                        Folder containing the .fna genome sequence files
  --i_gff INPUT_GFF_FOLDER
                        Folder containing the .gff gene annotation files
  -c CLADE_NAME, --clade CLADE_NAME
                        Name of the species pangenome database, for example:
                        -c ecoli17
  -o OUTPUT_FOLDER, --output OUTPUT_FOLDER
                        Result folder for all database files
  --th IDENTITY_PERCENATGE
                        Threshold of gene sequence similarity (in percentage),
                        default: 95.0 %.
  --tmp TEMP_FOLDER     Folder for temporary files, default: TMP_panphlan_db
  --uc                  Keep all usearch7 output files
  --verbose             Show progress information
  -v, --version         Prints the current PanPhlAn version and exits

panphlan_pangenome_generation.py requires Usearch 7 or Roary


Generating a user-specific pangenome database

Example of generating a PanPhlAn pangenome database of Eubacterium rectale based on five reference genomes available at NCBI.

1) Download all 5 genome (.fna) and corresponding gene annotation (.gff) files from NCBI

wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/605/GCF_000020605.1_ASM2060v1/GCF_000020605.1_ASM2060v1_genomic.fna.gz
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/404/855/GCF_001404855.1_13414_6_44/GCF_001404855.1_13414_6_44_genomic.fna.gz
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/405/295/GCF_001405295.1_14207_7_7/GCF_001405295.1_14207_7_7_genomic.fna.gz
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/375/GCF_001406375.1_14207_7_91/GCF_001406375.1_14207_7_91_genomic.fna.gz
wget -P fna/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/835/GCF_001406835.1_T1815/GCF_001406835.1_T1815_genomic.fna.gz

wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/605/GCF_000020605.1_ASM2060v1/GCF_000020605.1_ASM2060v1_genomic.gff.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/404/855/GCF_001404855.1_13414_6_44/GCF_001404855.1_13414_6_44_genomic.gff.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/405/295/GCF_001405295.1_14207_7_7/GCF_001405295.1_14207_7_7_genomic.gff.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/375/GCF_001406375.1_14207_7_91/GCF_001406375.1_14207_7_91_genomic.gff.gz
wget -P gff/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/835/GCF_001406835.1_T1815/GCF_001406835.1_T1815_genomic.gff.gz

Genomes are located in folder fna/ ; gene annotation files are located in gff/
Rename files to have a short filename that will be used as genome-ID.

ls fna
GCF_000020605.fna.gz  GCF_001404855.fna.gz  GCF_001405295.fna.gz  GCF_001406375.fna.gz  GCF_001406835.fna.gz

ls gff
GCF_000020605.gff.gz  GCF_001404855.gff.gz  GCF_001405295.gff.gz  GCF_001406375.gff.gz  GCF_001406835.gff.gz

2) Run PanPhlAn to generate the pangenome database (Usearch7 clustering)

Consider also Roary as a recommended alternative to Usearch7 clustering.

./panphlan/panphlan_pangenome_generation.py -c erectale18 --i_fna fna/ --i_gff gff/ -o database/ --verbose

Option -c specifies the species database name to use in PanPhlAn: erectale18 (Eubacterium rectale, version 2018).

Generated 9 database files are located in the -o output folder database/ and can be moved to the BOWTIE2_INDEXES directory, if exist.

ls database/
 panphlan_erectale18.1.bt2
 panphlan_erectale18.2.bt2
 panphlan_erectale18.3.bt2
 panphlan_erectale18.4.bt2
 panphlan_erectale18_annotations.csv
 panphlan_erectale18_centroids.ffn
 panphlan_erectale18_pangenome.csv
 panphlan_erectale18.rev.1.bt2
 panphlan_erectale18.rev.2.bt2

mv database/panphlan_erectale18* $BOWTIE2_INDEXES

3) Check profiles of reference genomes

cd database/
../panphlan/panphlan_profile.py -c erectale18 --add_strains --o_dna genefamily_presence_absence.tsv

genefamily_presence_absence.tsv contains the gene-family profiles of the reference genomes. It can be useful to detect outlier reference genomes, not related to the species.


FAQ

What about plasmids and contigs? Each strain is represented by a single genome .fna fasta file and an additional .ffn or .gff file of gene sequences. All contigs and plasmids of a strain have to be in the same .fna multi-fasta file. In the same way, all gene information of a strain have to be in a single .ffn or .gff file.

See also:
How to find and download reference genomes from NCBI?
How to import Roary pangenome into PanPhlAn?

Next step

Screen your metagenomic samples for species related genes by mapping against the species database : PanPhlAn mapping