-
Notifications
You must be signed in to change notification settings - Fork 6
PanPhlAn profiling
panphlan_profile.py
is used to merge and process the panphlan_map.py
results for getting the final gene presence/absence profiles of detected strains in samples, or for extracting the transcriptional activity of individual strains based on DNA & RNAseq pairs of the same sample.
Example:
./panphlan_profile.py -c erectale -i map_results/ --o_dna result_gene_presence_absence.csv --add_strains
-
-c CLADE_NAME
to specify the clade or species database-name; PanPhlAn will search for a file namedpanphlan_CLADE_NAME_pangenome.csv
-
-i
input directory countaining thepanphlan_map.py
results -
--add_strains
for adding also gene-family presence/absence profiles of reference genomes -
--o_covplot FILENAME
in order to generate the coverage curves figure -
-func_annot FILENAME
if annotation must be added on the gene. The provided file must map gen-families name to annotation data (UniRef50, KEGG ID, GO...). If several are present in the file--field
must also be detailed.
-
--o_dna
final result of gene-family presence/absence profiles of all detected strains. It consist of a.csv
file binary profile matrix. Gene-families are marked 1 when present and 0 when absent.
Example of a gene-familiy profile table in result file result_gene_presence_absence.csv
(output option --o_dna)
sample01 sample04 sample05 sample08
g00001 1 0 1 0
g00002 0 1 1 1
g00003 0 0 0 1
g00003 1 1 1 1
The presence/absence matrix can be used in mathematical/statistical software (R, Python, Matlab) to visualize similarities between strains by heatmaps or PCoA plots; for investigating which gene-families are present in same strains, but not in others; and for finding potential relations of diseases associated to the presence of specific genes.
./panphlan_profile.py -h
-h, --help show this help message and exit
-i INPUT_DNA_FOLDER, --i_dna INPUT_DNA_FOLDER
Input directory of panphlan_map.py results, containing
SAMPLE.csv.bz2 files
--i_bowtie2_indexes INPUT_BOWTIE2_INDEXES
Input directory of bowtie2 indexes
-c CLADE_NAME, --clade CLADE_NAME
Panphlan species/clade database (e.g.: ecoli16)
-o OUTPUT_FILE, --o_dna OUTPUT_FILE
Write gene family presence/absence matrix:
gene_presence_absence.csv
--i_rna INPUT_RNA_FOLDER
RNA-seq: input directory of RNA mapping results
SAMPLE_RNA.csv.bz2
--sample_pairs DNA_RNA_MAPPING
RNA-seq: list of DNA-RNA sequencing pairs from the
same biological sample.
--th_zero MINIMUM_THRESHOLD
Gene family presence/absence threshold: lower are non-
present gene families.
--th_present MEDIUM_THRESHOLD
Gene family presence/absence threshold: higher are
present gene families.
--th_multicopy MAXIMUM_THRESHOLD
Gene family presence/absence threshold: higher are
multicopy gene families.
--min_coverage MIN_COVERAGE_MEDIAN
Minimum coverage threshold, default: 2X
--left_max LEFT_MAX Strain presence/absence plateau curve threshold: left
max [1.25]
--right_min RIGHT_MIN
Strain presence/absence plateau curve threshold: right
min [0.75]
--rna_max_zeros RNA_MAX_ZEROES
Max accepted percent of zero coveraged gene-families
(default: <10 %).
--rna_norm_percentile RNA_NORM_PERCENTILE
Percentile for normalizing RNA/DNA ratios
--strain_similarity_perc SIMILARITY_PERCENTAGE
Minimum threshold (percentage) for genome length to
add a reference genome to presence/absence matrix
(default: 50).
--np NON_PRESENCE_TOKEN
User-defined string to mark non-present genes. [NP]
--nan NOT_A_NUMBER_TOKEN
User-defined string to mark multicopy and undefined
genes. [NaN]
--o_covplot COV_PLOT_NAME
Filename for gene-family coverage plot.
--o_covplot_normed NOR_PLOT_NAME
Filename for normalized gene-family coverage plot.
--o_cov PANCOVERAGE_FILE
Write raw gene-family coverage matrix.
--o_idx DNA_INDEX_FILE
Write gene-family plateau definitions (1, -1, -2, -3)
--o_rna RNA_EXPRS_FILE
Write normalized gene-family transcription values
(RNA-seq).
--strain_hit_genes_perc GENEHIT_PERC_PER_STRAIN
Write overlap of gene-families between samples-strains
and reference genomes.
--i_cov INPUT_COV_MATRIX
Read coverage matrix (option --o_cov) for re-analysis
using other thresholds
--num_genomes INPUT_COV_GENOMES
In addition to option --i_cov: number of reference
genomes
--genome_avg_length INPUT_COV_LENGTH
In addition to option --i_cov: average number of gene-
families
--func_annot FUNC_ANNOT_FILE
File mapping UniRef IDs to GO/KEGG/... annotation for
functional characterization
-f ANNOT_FIELD, --field ANNOT_FIELD
Field in the annotation file that must be added to the
presence/absence matrix
--add_strains Add reference genomes to gene-family presence/absence
matrix.
--interactive Plot coverage curves to screen, and not to a file.
--verbose Display progress information.
-v, --version Prints the current PanPhlAn version and exits.
Depending on the sample coverage depth and expected species abundance, different sensitivity thresholds for strain presence/absence detection can be chosen. This includes a strain minimum coverage and settings for accepted shapes of strain abundance curves.
--min_coverage 5 --left_max 1.18 --right_min 0.82
(very stringent)
--min_coverage 2 --left_max 1.25 --right_min 0.75
(default)
--min_coverage 1 --left_max 1.70 --right_min 0.30
(very sensitive)
For all samples in which a strain could be detected, we define presence and absence of gene-families based on a threshold on normalized coverage curves:
--th_present 0.5
All gene-families larger than 0.5 are considered as present (1), and lower as absent (0).
In-vivo transcriptional activity of individual strains of a species
When metagenomic and metatranscriptomic samples from the same specimen are available, PanPhlAn provides gene-specific transcription rates of individual strains in a sample. In a first step, the DNA sample is used to detect the strain-specific gene-family set. In the second step, transcription levels are extracted from the corresponding RNA samples and converted into log and median normalized RNA/DNA ratios.
PanPhlAn requires a list of paired metagenomic (DNA) and metatranscriptomic (RNA) sequencing sample-ID's from the same biological sample. The pairs are defined by a tab-separated text file in which the first column specifies the DNA sample-ID and the second column the corresponding RNA sample-ID, both without file extension.
cat sample_pairs_DNA_RNA.txt
# DNA RNA
sampleA sampleA_RNA
sampleB sampleB_RNA
sampleC sampleC_RNA
sampleD sampleD_RNA
For example, to get transcriptional activities for Escherichia coli strains, we need to download the related pangenome database:
→ ecoli16
Escherichia coli (version 2016)
Both DNA and RNA samples are processed in the same way by panphlan_map.py but saved in different folders.
./panphlan_map.py -c ecoli16 -i Samples/sampleA.tar.gz -o DNA/sampleA
./panphlan_map.py -c ecoli16 -i Samples/sampleA_RNA.tar.gz -o RNA/sampleA_RNA
Now, we have mapping results of both DNA and RNA which are required in the next step for extracting gene-family transcript profiles.
ls DNA/
sampleA_ecoli16.csv.bz2
sampleB_ecoli16.csv.bz2
sampleC_ecoli16.csv.bz2
sampleD_ecoli16.csv.bz2
ls RNA/
sampleA_RNA_ecoli16.csv.bz2
sampleB_RNA_ecoli16.csv.bz2
sampleC_RNA_ecoli16.csv.bz2
sampleD_RNA_ecoli16.csv.bz2
Now, panphlan_profile.py
can be used to extract the transcription rates of strain-specific genes.
./panphlan_profile.py -c ecoli16 --i_dna DNA/ --i_rna RNA/ --sample_pairs sample_pairs_DNA_RNA.txt --o_rna results_transcription_rates.csv
As input we provide the mapping results of DNA and RNA samples and the sample-pair text-file that defines which DNA and RNA sample belong together.
As result we get a table of transcription rates for each pangenome gene-family for each sample. Gene-families not present in the sample specific strain are marked as NP (not present). Gene-families that could not clearly defined as present only in the specific strain of a sample are marked as NaN (missing value). NaN includes multi-copy genes as well as genes at the borderline between presence and absence.
Normalized transcription rates are positive values centered at 1. A low values smaller than 1 refers to a low transcriptional activity, a large value bigger than 1 refers to a high transcriptional activity relative to all other gene-family activities (median normalized).
Example result of transcription rates:
cat results_transcription_rates.csv
sample_A sample_B sample_C sample_D
g000002 0.814 0.779 0.000 NaN
g000003 0.982 0.770 1.183 0.000
g000004 0.863 0.917 1.219 NaN
g000005 NP 0.000 0.000 NaN
g000006 0.773 0.000 NP 0.000
See also: Error: RuntimeWarning: invalid value encountered in double_scalars
PanPhlAn is a project of the Computational Metagenomics Lab at CIBIO, University of Trento, Italy.
- PanPhlAn 3.0
- PanPhlAn 1.3