Skip to content

PanPhlAn profiling

Leonard Dubois edited this page May 18, 2020 · 2 revisions

panphlan_profile.py is used to merge and process the panphlan_map.py results for getting the final gene presence/absence profiles of detected strains in samples, or for extracting the transcriptional activity of individual strains based on DNA & RNAseq pairs of the same sample.

Example:
./panphlan_profile.py -c erectale -i map_results/ --o_dna result_gene_presence_absence.csv --add_strains


Input

  • -c CLADE_NAME to specify the clade or species database-name; PanPhlAn will search for a file named panphlan_CLADE_NAME_pangenome.csv
  • -i input directory countaining the panphlan_map.py results
  • --add_strains for adding also gene-family presence/absence profiles of reference genomes
  • --o_covplot FILENAME in order to generate the coverage curves figure
  • -func_annot FILENAME if annotation must be added on the gene. The provided file must map gen-families name to annotation data (UniRef50, KEGG ID, GO...). If several are present in the file --field must also be detailed.

Output

  • --o_dna final result of gene-family presence/absence profiles of all detected strains. It consist of a .csv file binary profile matrix. Gene-families are marked 1 when present and 0 when absent.

Example of a gene-familiy profile table in result file result_gene_presence_absence.csv (output option --o_dna)

        sample01 sample04 sample05 sample08
g00001      1       0        1        0
g00002      0       1        1        1
g00003      0       0        0        1
g00003      1       1        1        1

The presence/absence matrix can be used in mathematical/statistical software (R, Python, Matlab) to visualize similarities between strains by heatmaps or PCoA plots; for investigating which gene-families are present in same strains, but not in others; and for finding potential relations of diseases associated to the presence of specific genes.

Help

./panphlan_profile.py -h
    -h, --help            show this help message and exit
  -i INPUT_DNA_FOLDER, --i_dna INPUT_DNA_FOLDER
                        Input directory of panphlan_map.py results, containing
                        SAMPLE.csv.bz2 files
  --i_bowtie2_indexes INPUT_BOWTIE2_INDEXES
                        Input directory of bowtie2 indexes
  -c CLADE_NAME, --clade CLADE_NAME
                        Panphlan species/clade database (e.g.: ecoli16)
  -o OUTPUT_FILE, --o_dna OUTPUT_FILE
                        Write gene family presence/absence matrix:
                        gene_presence_absence.csv
  --i_rna INPUT_RNA_FOLDER
                        RNA-seq: input directory of RNA mapping results
                        SAMPLE_RNA.csv.bz2
  --sample_pairs DNA_RNA_MAPPING
                        RNA-seq: list of DNA-RNA sequencing pairs from the
                        same biological sample.
  --th_zero MINIMUM_THRESHOLD
                        Gene family presence/absence threshold: lower are non-
                        present gene families.
  --th_present MEDIUM_THRESHOLD
                        Gene family presence/absence threshold: higher are
                        present gene families.
  --th_multicopy MAXIMUM_THRESHOLD
                        Gene family presence/absence threshold: higher are
                        multicopy gene families.
  --min_coverage MIN_COVERAGE_MEDIAN
                        Minimum coverage threshold, default: 2X
  --left_max LEFT_MAX   Strain presence/absence plateau curve threshold: left
                        max [1.25]
  --right_min RIGHT_MIN
                        Strain presence/absence plateau curve threshold: right
                        min [0.75]
  --rna_max_zeros RNA_MAX_ZEROES
                        Max accepted percent of zero coveraged gene-families
                        (default: <10 %).
  --rna_norm_percentile RNA_NORM_PERCENTILE
                        Percentile for normalizing RNA/DNA ratios
  --strain_similarity_perc SIMILARITY_PERCENTAGE
                        Minimum threshold (percentage) for genome length to
                        add a reference genome to presence/absence matrix
                        (default: 50).
  --np NON_PRESENCE_TOKEN
                        User-defined string to mark non-present genes. [NP]
  --nan NOT_A_NUMBER_TOKEN
                        User-defined string to mark multicopy and undefined
                        genes. [NaN]
  --o_covplot COV_PLOT_NAME
                        Filename for gene-family coverage plot.
  --o_covplot_normed NOR_PLOT_NAME
                        Filename for normalized gene-family coverage plot.
  --o_cov PANCOVERAGE_FILE
                        Write raw gene-family coverage matrix.
  --o_idx DNA_INDEX_FILE
                        Write gene-family plateau definitions (1, -1, -2, -3)
  --o_rna RNA_EXPRS_FILE
                        Write normalized gene-family transcription values
                        (RNA-seq).
  --strain_hit_genes_perc GENEHIT_PERC_PER_STRAIN
                        Write overlap of gene-families between samples-strains
                        and reference genomes.
  --i_cov INPUT_COV_MATRIX
                        Read coverage matrix (option --o_cov) for re-analysis
                        using other thresholds
  --num_genomes INPUT_COV_GENOMES
                        In addition to option --i_cov: number of reference
                        genomes
  --genome_avg_length INPUT_COV_LENGTH
                        In addition to option --i_cov: average number of gene-
                        families
  --func_annot FUNC_ANNOT_FILE
                        File mapping UniRef IDs to GO/KEGG/... annotation for
                        functional characterization
  -f ANNOT_FIELD, --field ANNOT_FIELD
                        Field in the annotation file that must be added to the
                        presence/absence matrix
  --add_strains         Add reference genomes to gene-family presence/absence
                        matrix.
  --interactive         Plot coverage curves to screen, and not to a file.
  --verbose             Display progress information.
  -v, --version         Prints the current PanPhlAn version and exits.

PanPhlAn profile thresholds

Strain detection thresholds

Depending on the sample coverage depth and expected species abundance, different sensitivity thresholds for strain presence/absence detection can be chosen. This includes a strain minimum coverage and settings for accepted shapes of strain abundance curves.
--min_coverage 5 --left_max 1.18 --right_min 0.82 (very stringent)
--min_coverage 2 --left_max 1.25 --right_min 0.75 (default)
--min_coverage 1 --left_max 1.70 --right_min 0.30 (very sensitive)

Gene-family detection thresholds

For all samples in which a strain could be detected, we define presence and absence of gene-families based on a threshold on normalized coverage curves:
--th_present 0.5
All gene-families larger than 0.5 are considered as present (1), and lower as absent (0).


PanPhlAn RNA-seq

In-vivo transcriptional activity of individual strains of a species

When metagenomic and metatranscriptomic samples from the same specimen are available, PanPhlAn provides gene-specific transcription rates of individual strains in a sample. In a first step, the DNA sample is used to detect the strain-specific gene-family set. In the second step, transcription levels are extracted from the corresponding RNA samples and converted into log and median normalized RNA/DNA ratios.

a) Define DNA RNA sample pairs

PanPhlAn requires a list of paired metagenomic (DNA) and metatranscriptomic (RNA) sequencing sample-ID's from the same biological sample. The pairs are defined by a tab-separated text file in which the first column specifies the DNA sample-ID and the second column the corresponding RNA sample-ID, both without file extension.

cat sample_pairs_DNA_RNA.txt

 # DNA   RNA
 sampleA   sampleA_RNA
 sampleB   sampleB_RNA
 sampleC   sampleC_RNA
 sampleD   sampleD_RNA

b) Download the species pangenome database

For example, to get transcriptional activities for Escherichia coli strains, we need to download the related pangenome database:
ecoli16 Escherichia coli (version 2016)

c) Mapping

Both DNA and RNA samples are processed in the same way by panphlan_map.py but saved in different folders.

./panphlan_map.py -c ecoli16 -i Samples/sampleA.tar.gz -o DNA/sampleA
./panphlan_map.py -c ecoli16 -i Samples/sampleA_RNA.tar.gz -o RNA/sampleA_RNA

Now, we have mapping results of both DNA and RNA which are required in the next step for extracting gene-family transcript profiles.

ls DNA/
 sampleA_ecoli16.csv.bz2
 sampleB_ecoli16.csv.bz2
 sampleC_ecoli16.csv.bz2
 sampleD_ecoli16.csv.bz2

ls RNA/ 
 sampleA_RNA_ecoli16.csv.bz2
 sampleB_RNA_ecoli16.csv.bz2
 sampleC_RNA_ecoli16.csv.bz2
 sampleD_RNA_ecoli16.csv.bz2

d) Get transcription rates

Now, panphlan_profile.py can be used to extract the transcription rates of strain-specific genes.

./panphlan_profile.py -c ecoli16 --i_dna DNA/ --i_rna RNA/ --sample_pairs sample_pairs_DNA_RNA.txt --o_rna results_transcription_rates.csv

As input we provide the mapping results of DNA and RNA samples and the sample-pair text-file that defines which DNA and RNA sample belong together.
As result we get a table of transcription rates for each pangenome gene-family for each sample. Gene-families not present in the sample specific strain are marked as NP (not present). Gene-families that could not clearly defined as present only in the specific strain of a sample are marked as NaN (missing value). NaN includes multi-copy genes as well as genes at the borderline between presence and absence. Normalized transcription rates are positive values centered at 1. A low values smaller than 1 refers to a low transcriptional activity, a large value bigger than 1 refers to a high transcriptional activity relative to all other gene-family activities (median normalized).

Example result of transcription rates:

cat results_transcription_rates.csv

         sample_A  sample_B  sample_C  sample_D
g000002	  0.814     0.779     0.000     NaN
g000003   0.982     0.770     1.183     0.000
g000004	  0.863     0.917     1.219     NaN
g000005	  NP        0.000     0.000     NaN
g000006	  0.773     0.000     NP        0.000

See also: Error: RuntimeWarning: invalid value encountered in double_scalars