-
Notifications
You must be signed in to change notification settings - Fork 6
PanPhlAn profiling
panphlan_profile.py
is used to merge and process the panphlan_map.py
results for getting the final gene presence/absence profiles of detected strains in samples, or for extracting the transcriptional activity of individual strains based on DNA & RNAseq pairs of the same sample.
Example:
./panphlan_profile.py -c erectale -i map_results/ --o_dna result_gene_presence_absence.csv --add_strains
-
-c CLADE_NAME
to specify the clade or species database-name; PanPhlAn will search for a file namedpanphlan_CLADE_NAME_pangenome.csv
-
-i
input directory countaining thepanphlan_map.py
results -
--add_strains
for adding also gene-family presence/absence profiles of reference genomes -
--o_covplot FILENAME
in order to generate the coverage curves figure -
-func_annot FILENAME
if annotation must be added on the gene. The provided file must map gen-families name to annotation data (UniRef50, KEGG ID, GO...). If several are present in the file--field
must also be detailed.
-
--o_dna
final result of gene-family presence/absence profiles of all detected strains. It consist of a.csv
file binary profile matrix. Gene-families are marked 1 when present and 0 when absent.
Example of a gene-familiy profile table in result file result_gene_presence_absence.csv
(output option --o_dna)
sample01 sample04 sample05 sample08
g00001 1 0 1 0
g00002 0 1 1 1
g00003 0 0 0 1
g00003 1 1 1 1
The presence/absence matrix can be used in mathematical/statistical software (R, Python, Matlab) to visualize similarities between strains by heatmaps or PCoA plots; for investigating which gene-families are present in same strains, but not in others; and for finding potential relations of diseases associated to the presence of specific genes.
./panphlan_profile.py -h
-h, --help show this help message and exit
-i INPUT_DNA_FOLDER, --i_dna INPUT_DNA_FOLDER
Input directory of panphlan_map.py results, containing
SAMPLE.csv.bz2 files
--i_bowtie2_indexes INPUT_BOWTIE2_INDEXES
Input directory of bowtie2 indexes
-c CLADE_NAME, --clade CLADE_NAME
Panphlan species/clade database (e.g.: ecoli16)
-o OUTPUT_FILE, --o_dna OUTPUT_FILE
Write gene family presence/absence matrix:
gene_presence_absence.csv
--i_rna INPUT_RNA_FOLDER
RNA-seq: input directory of RNA mapping results
SAMPLE_RNA.csv.bz2
--sample_pairs DNA_RNA_MAPPING
RNA-seq: list of DNA-RNA sequencing pairs from the
same biological sample.
--th_zero MINIMUM_THRESHOLD
Gene family presence/absence threshold: lower are non-
present gene families.
--th_present MEDIUM_THRESHOLD
Gene family presence/absence threshold: higher are
present gene families.
--th_multicopy MAXIMUM_THRESHOLD
Gene family presence/absence threshold: higher are
multicopy gene families.
--min_coverage MIN_COVERAGE_MEDIAN
Minimum coverage threshold, default: 2X
--left_max LEFT_MAX Strain presence/absence plateau curve threshold: left
max [1.25]
--right_min RIGHT_MIN
Strain presence/absence plateau curve threshold: right
min [0.75]
--rna_max_zeros RNA_MAX_ZEROES
Max accepted percent of zero coveraged gene-families
(default: <10 %).
--rna_norm_percentile RNA_NORM_PERCENTILE
Percentile for normalizing RNA/DNA ratios
--strain_similarity_perc SIMILARITY_PERCENTAGE
Minimum threshold (percentage) for genome length to
add a reference genome to presence/absence matrix
(default: 50).
--np NON_PRESENCE_TOKEN
User-defined string to mark non-present genes. [NP]
--nan NOT_A_NUMBER_TOKEN
User-defined string to mark multicopy and undefined
genes. [NaN]
--o_covplot COV_PLOT_NAME
Filename for gene-family coverage plot.
--o_covplot_normed NOR_PLOT_NAME
Filename for normalized gene-family coverage plot.
--o_cov PANCOVERAGE_FILE
Write raw gene-family coverage matrix.
--o_idx DNA_INDEX_FILE
Write gene-family plateau definitions (1, -1, -2, -3)
--o_rna RNA_EXPRS_FILE
Write normalized gene-family transcription values
(RNA-seq).
--strain_hit_genes_perc GENEHIT_PERC_PER_STRAIN
Write overlap of gene-families between samples-strains
and reference genomes.
--i_cov INPUT_COV_MATRIX
Read coverage matrix (option --o_cov) for re-analysis
using other thresholds
--num_genomes INPUT_COV_GENOMES
In addition to option --i_cov: number of reference
genomes
--genome_avg_length INPUT_COV_LENGTH
In addition to option --i_cov: average number of gene-
families
--func_annot FUNC_ANNOT_FILE
File mapping UniRef IDs to GO/KEGG/... annotation for
functional characterization
-f ANNOT_FIELD, --field ANNOT_FIELD
Field in the annotation file that must be added to the
presence/absence matrix
--add_strains Add reference genomes to gene-family presence/absence
matrix.
--interactive Plot coverage curves to screen, and not to a file.
--verbose Display progress information.
-v, --version Prints the current PanPhlAn version and exits.
--strain_hit_genes_perc Overlap of gene-families between samples-strains and reference genomes
By adding the option --strain_hit_genes_perc result_genes_per_strain.csv
, we get the percentage of how many gene-families are common between detected strains and reference genomes. In our tutorial example, all detected strains share less than 70% genes, which shows that they are distinct form known reference genomes.
head result_genes_per_strain.csv
strainID number_of_genes SRS013951 SRS014459 SRS015065 SRS019161
CP001107 3420 58.4 60.8 61.2 56.3
FP929042 2879 68.8 67.4 67.6 67.2
FP929043 3182 66.7 61.1 61.5 64.7
Depending on the sample coverage depth and expected species abundance, different sensitivity thresholds for strain presence/absence detection can be chosen. This includes a strain minimum coverage and settings for accepted shapes of strain abundance curves.
--min_coverage 5 --left_max 1.18 --right_min 0.82
(very stringent)
--min_coverage 2 --left_max 1.25 --right_min 0.75
(default)
--min_coverage 1 --left_max 1.70 --right_min 0.30
(very sensitive)
For all samples in which a strain could be detected, we define presence and absence of gene-families based on a threshold on normalized coverage curves:
--th_present 0.5
All gene-families larger than 0.5 are considered as present (1), and lower as absent (0).
In-vivo transcriptional activity of individual strains of a species
When metagenomic and metatranscriptomic samples from the same specimen are available, PanPhlAn provides gene-specific transcription rates of individual strains in a sample. In a first step, the DNA sample is used to detect the strain-specific gene-family set. In the second step, transcription levels are extracted from the corresponding RNA samples and converted into log and median normalized RNA/DNA ratios.
PanPhlAn requires a list of paired metagenomic (DNA) and metatranscriptomic (RNA) sequencing sample-ID's from the same biological sample. The pairs are defined by a tab-separated text file in which the first column specifies the DNA sample-ID and the second column the corresponding RNA sample-ID, both without file extension.
cat sample_pairs_DNA_RNA.txt
# DNA RNA
sampleA sampleA_RNA
sampleB sampleB_RNA
sampleC sampleC_RNA
sampleD sampleD_RNA
For example, to get transcriptional activities for Escherichia coli strains, we need to download the related pangenome database:
→ ecoli16
Escherichia coli (version 2016)
Both DNA and RNA samples are processed in the same way by panphlan_map.py but saved in different folders.
./panphlan_map.py -c ecoli16 -i Samples/sampleA.tar.gz -o DNA/sampleA
./panphlan_map.py -c ecoli16 -i Samples/sampleA_RNA.tar.gz -o RNA/sampleA_RNA
Now, we have mapping results of both DNA and RNA which are required in the next step for extracting gene-family transcript profiles.
ls DNA/
sampleA_ecoli16.csv.bz2
sampleB_ecoli16.csv.bz2
sampleC_ecoli16.csv.bz2
sampleD_ecoli16.csv.bz2
ls RNA/
sampleA_RNA_ecoli16.csv.bz2
sampleB_RNA_ecoli16.csv.bz2
sampleC_RNA_ecoli16.csv.bz2
sampleD_RNA_ecoli16.csv.bz2
Now, panphlan_profile.py
can be used to extract the transcription rates of strain-specific genes.
./panphlan_profile.py -c ecoli16 --i_dna DNA/ --i_rna RNA/ --sample_pairs sample_pairs_DNA_RNA.txt --o_rna results_transcription_rates.csv
As input we provide the mapping results of DNA and RNA samples and the sample-pair text-file that defines which DNA and RNA sample belong together.
As result we get a table of transcription rates for each pangenome gene-family for each sample. Gene-families not present in the sample specific strain are marked as NP (not present). Gene-families that could not clearly defined as present only in the specific strain of a sample are marked as NaN (missing value). NaN includes multi-copy genes as well as genes at the borderline between presence and absence.
Normalized transcription rates are positive values centered at 1. A low values smaller than 1 refers to a low transcriptional activity, a large value bigger than 1 refers to a high transcriptional activity relative to all other gene-family activities (median normalized).
Example result of transcription rates:
cat results_transcription_rates.csv
sample_A sample_B sample_C sample_D
g000002 0.814 0.779 0.000 NaN
g000003 0.982 0.770 1.183 0.000
g000004 0.863 0.917 1.219 NaN
g000005 NP 0.000 0.000 NaN
g000006 0.773 0.000 NP 0.000
See also: Error: RuntimeWarning: invalid value encountered in double_scalars
PanPhlAn is a project of the Computational Metagenomics Lab at CIBIO, University of Trento, Italy.
- PanPhlAn 3.0
- PanPhlAn 1.3