Skip to content

PanPhlAn profiling

Leonard Dubois edited this page May 18, 2020 · 2 revisions

panphlan_profile.py is used to merge and process the panphlan_map.py results for getting the final gene presence/absence profiles of detected strains in samples, or for extracting the transcriptional activity of individual strains based on DNA & RNAseq pairs of the same sample.

Example:
./panphlan_profile.py -c erectale -i map_results/ --o_dna result_gene_presence_absence.csv --add_strains


Input

  • -c CLADE_NAME to specify the clade or species database-name; PanPhlAn will search for a file named panphlan_CLADE_NAME_pangenome.csv
  • -i input directory countaining the panphlan_map.py results
  • --add_strains for adding also gene-family presence/absence profiles of reference genomes
  • --o_covplot FILENAME in order to generate the coverage curves figure
  • -func_annot FILENAME if annotation must be added on the gene. The provided file must map gen-families name to annotation data (UniRef50, KEGG ID, GO...). If several are present in the file --field must also be detailed.

Output

  • --o_dna final result of gene-family presence/absence profiles of all detected strains. It consist of a .csv file binary profile matrix. Gene-families are marked 1 when present and 0 when absent.

Example of a gene-familiy profile table in result file result_gene_presence_absence.csv (output option --o_dna)

        sample01 sample04 sample05 sample08
g00001      1       0        1        0
g00002      0       1        1        1
g00003      0       0        0        1
g00003      1       1        1        1

The presence/absence matrix can be used in mathematical/statistical software (R, Python, Matlab) to visualize similarities between strains by heatmaps or PCoA plots; for investigating which gene-families are present in same strains, but not in others; and for finding potential relations of diseases associated to the presence of specific genes.

Help

./panphlan_profile.py -h
    -h, --help            show this help message and exit
  -i INPUT_DNA_FOLDER, --i_dna INPUT_DNA_FOLDER
                        Input directory of panphlan_map.py results, containing
                        SAMPLE.csv.bz2 files
  --i_bowtie2_indexes INPUT_BOWTIE2_INDEXES
                        Input directory of bowtie2 indexes
  -c CLADE_NAME, --clade CLADE_NAME
                        Panphlan species/clade database (e.g.: ecoli16)
  -o OUTPUT_FILE, --o_dna OUTPUT_FILE
                        Write gene family presence/absence matrix:
                        gene_presence_absence.csv
  --i_rna INPUT_RNA_FOLDER
                        RNA-seq: input directory of RNA mapping results
                        SAMPLE_RNA.csv.bz2
  --sample_pairs DNA_RNA_MAPPING
                        RNA-seq: list of DNA-RNA sequencing pairs from the
                        same biological sample.
  --th_zero MINIMUM_THRESHOLD
                        Gene family presence/absence threshold: lower are non-
                        present gene families.
  --th_present MEDIUM_THRESHOLD
                        Gene family presence/absence threshold: higher are
                        present gene families.
  --th_multicopy MAXIMUM_THRESHOLD
                        Gene family presence/absence threshold: higher are
                        multicopy gene families.
  --min_coverage MIN_COVERAGE_MEDIAN
                        Minimum coverage threshold, default: 2X
  --left_max LEFT_MAX   Strain presence/absence plateau curve threshold: left
                        max [1.25]
  --right_min RIGHT_MIN
                        Strain presence/absence plateau curve threshold: right
                        min [0.75]
  --rna_max_zeros RNA_MAX_ZEROES
                        Max accepted percent of zero coveraged gene-families
                        (default: <10 %).
  --rna_norm_percentile RNA_NORM_PERCENTILE
                        Percentile for normalizing RNA/DNA ratios
  --strain_similarity_perc SIMILARITY_PERCENTAGE
                        Minimum threshold (percentage) for genome length to
                        add a reference genome to presence/absence matrix
                        (default: 50).
  --np NON_PRESENCE_TOKEN
                        User-defined string to mark non-present genes. [NP]
  --nan NOT_A_NUMBER_TOKEN
                        User-defined string to mark multicopy and undefined
                        genes. [NaN]
  --o_covplot COV_PLOT_NAME
                        Filename for gene-family coverage plot.
  --o_covplot_normed NOR_PLOT_NAME
                        Filename for normalized gene-family coverage plot.
  --o_cov PANCOVERAGE_FILE
                        Write raw gene-family coverage matrix.
  --o_idx DNA_INDEX_FILE
                        Write gene-family plateau definitions (1, -1, -2, -3)
  --o_rna RNA_EXPRS_FILE
                        Write normalized gene-family transcription values
                        (RNA-seq).
  --strain_hit_genes_perc GENEHIT_PERC_PER_STRAIN
                        Write overlap of gene-families between samples-strains
                        and reference genomes.
  --i_cov INPUT_COV_MATRIX
                        Read coverage matrix (option --o_cov) for re-analysis
                        using other thresholds
  --num_genomes INPUT_COV_GENOMES
                        In addition to option --i_cov: number of reference
                        genomes
  --genome_avg_length INPUT_COV_LENGTH
                        In addition to option --i_cov: average number of gene-
                        families
  --func_annot FUNC_ANNOT_FILE
                        File mapping UniRef IDs to GO/KEGG/... annotation for
                        functional characterization
  -f ANNOT_FIELD, --field ANNOT_FIELD
                        Field in the annotation file that must be added to the
                        presence/absence matrix
  --add_strains         Add reference genomes to gene-family presence/absence
                        matrix.
  --interactive         Plot coverage curves to screen, and not to a file.
  --verbose             Display progress information.
  -v, --version         Prints the current PanPhlAn version and exits.


PanPhlAn profile options

--strain_hit_genes_perc Overlap of gene-families between samples-strains and reference genomes

By adding the option --strain_hit_genes_perc result_genes_per_strain.csv, we get the percentage of how many gene-families are common between detected strains and reference genomes. In our tutorial example, all detected strains share less than 70% genes, which shows that they are distinct form known reference genomes.

head result_genes_per_strain.csv

strainID  number_of_genes SRS013951 SRS014459 SRS015065 SRS019161
CP001107    3420           58.4      60.8      61.2      56.3
FP929042    2879           68.8      67.4      67.6      67.2
FP929043    3182           66.7      61.1      61.5      64.7

PanPhlAn profile thresholds

Strain detection thresholds

Depending on the sample coverage depth and expected species abundance, different sensitivity thresholds for strain presence/absence detection can be chosen. This includes a strain minimum coverage and settings for accepted shapes of strain abundance curves.
--min_coverage 5 --left_max 1.18 --right_min 0.82 (very stringent)
--min_coverage 2 --left_max 1.25 --right_min 0.75 (default)
--min_coverage 1 --left_max 1.70 --right_min 0.30 (very sensitive)

Gene-family detection thresholds

For all samples in which a strain could be detected, we define presence and absence of gene-families based on a threshold on normalized coverage curves:
--th_present 0.5
All gene-families larger than 0.5 are considered as present (1), and lower as absent (0).


PanPhlAn RNA-seq

In-vivo transcriptional activity of individual strains of a species

When metagenomic and metatranscriptomic samples from the same specimen are available, PanPhlAn provides gene-specific transcription rates of individual strains in a sample. In a first step, the DNA sample is used to detect the strain-specific gene-family set. In the second step, transcription levels are extracted from the corresponding RNA samples and converted into log and median normalized RNA/DNA ratios.

a) Define DNA RNA sample pairs

PanPhlAn requires a list of paired metagenomic (DNA) and metatranscriptomic (RNA) sequencing sample-ID's from the same biological sample. The pairs are defined by a tab-separated text file in which the first column specifies the DNA sample-ID and the second column the corresponding RNA sample-ID, both without file extension.

cat sample_pairs_DNA_RNA.txt

 # DNA   RNA
 sampleA   sampleA_RNA
 sampleB   sampleB_RNA
 sampleC   sampleC_RNA
 sampleD   sampleD_RNA

b) Download the species pangenome database

For example, to get transcriptional activities for Escherichia coli strains, we need to download the related pangenome database:
ecoli16 Escherichia coli (version 2016)

c) Mapping

Both DNA and RNA samples are processed in the same way by panphlan_map.py but saved in different folders.

./panphlan_map.py -c ecoli16 -i Samples/sampleA.tar.gz -o DNA/sampleA
./panphlan_map.py -c ecoli16 -i Samples/sampleA_RNA.tar.gz -o RNA/sampleA_RNA

Now, we have mapping results of both DNA and RNA which are required in the next step for extracting gene-family transcript profiles.

ls DNA/
 sampleA_ecoli16.csv.bz2
 sampleB_ecoli16.csv.bz2
 sampleC_ecoli16.csv.bz2
 sampleD_ecoli16.csv.bz2

ls RNA/ 
 sampleA_RNA_ecoli16.csv.bz2
 sampleB_RNA_ecoli16.csv.bz2
 sampleC_RNA_ecoli16.csv.bz2
 sampleD_RNA_ecoli16.csv.bz2

d) Get transcription rates

Now, panphlan_profile.py can be used to extract the transcription rates of strain-specific genes.

./panphlan_profile.py -c ecoli16 --i_dna DNA/ --i_rna RNA/ --sample_pairs sample_pairs_DNA_RNA.txt --o_rna results_transcription_rates.csv

As input we provide the mapping results of DNA and RNA samples and the sample-pair text-file that defines which DNA and RNA sample belong together.
As result we get a table of transcription rates for each pangenome gene-family for each sample. Gene-families not present in the sample specific strain are marked as NP (not present). Gene-families that could not clearly defined as present only in the specific strain of a sample are marked as NaN (missing value). NaN includes multi-copy genes as well as genes at the borderline between presence and absence. Normalized transcription rates are positive values centered at 1. A low values smaller than 1 refers to a low transcriptional activity, a large value bigger than 1 refers to a high transcriptional activity relative to all other gene-family activities (median normalized).

Example result of transcription rates:

cat results_transcription_rates.csv

         sample_A  sample_B  sample_C  sample_D
g000002	  0.814     0.779     0.000     NaN
g000003   0.982     0.770     1.183     0.000
g000004	  0.863     0.917     1.219     NaN
g000005	  NP        0.000     0.000     NaN
g000006	  0.773     0.000     NP        0.000

See also: Error: RuntimeWarning: invalid value encountered in double_scalars