-
Notifications
You must be signed in to change notification settings - Fork 0
Automated reconstruction
We will use Methanobrevibacter oralis as an example species to demonstrate how to build a whole-genome alignment comprising strains from ancient and contemporary metagenomics datasets.
Step 1. Data preparation
In this tutorial, metagenomics samples below have been detected M. oralis with a good abundance and sub-sampled for the rapid analysis. Please download them:
Metagenomic reads | Genome sequences |
---|---|
ERR3003615 | GCF_902384065 |
ERR3003647 | GCF_900289035 |
ERR3003621 | GCA_001639275 |
ERR3003644 | GCF_000529525 |
ERR3003651 | |
ERR3003619 | |
ERR3003623 | |
ERR3003622 | |
ERR3003637 | |
ERR3003646 | |
ERR3003614 | |
SRR6877288 | |
ERR3003640 | |
ERR3003632 | |
SRR6877399 | |
SRR6877339 |
Create a working directory called MetaClock
:
$ mkdir MetaClock
Change the working directory into MetaClock
:
$ cd MetaClock
Under working directory MetaClock
, create a directory called ancient_metagenomes
and a directory called modern_genome_assemblies
:
$ mkdir ancient_metagenomes; mkdir modern_genome_assemblies
Store modern reference genomes GCF_000529525.fna
,GCF_900289035.fna
,GCF_902384065.fna
in the directory modern_genome_assemblies
; for preparing ancient samples, create a folder for each ancient metagenomics sample under directory ancient_metagenomes
; modern reference GCA_001639275.fna
will be used as reference to guide alignment.
Once data preparation is all completed:
Firstly, assuming MetaClock
is the current working directory, check if MetaClock
has GCA_001639275.fna
as alignment reference, ancient_metagenomes
as ancient sample input, modern_genome_assemblies
as modern sample input:
command line:
$ ls .
result:
GCA_001639275.fna ancient_metagenomes modern_genome_assemblies
Secondly, check modern sample input: command line:
$ ls modern_genome_assemblies
result:
GCF_000529525.fna GCF_900289035.fna GCF_902384065.fna
Thirdly, check ancient sample input: command line:
$ ls ancient_metagenomes
result:
ERR3003614 ERR3003619 ERR3003622 ERR3003632 ERR3003640 ERR3003646 ERR3003651 SRR6877339
ERR3003615 ERR3003621 ERR3003623 ERR3003637 ERR3003644 ERR3003647 SRR6877288 SRR6877399
*Note:
Each metagenomics sample folder can contain multiple FASTQ files from different libraries and possibly in three different file types (fastq.bz2
, fastq.gz
, fastq
). For example, here sample ERR3003632
contains one fastq file compressed in bz2
form:
command line:
$ ls ancient_metagenomes/ERR3003632
result:
ERR3003632.fastq.bz2
Step 2. Configuration file preparation
Parameter-rich configuration is required for reconstructing whole genome alignment from metagenomics data sets using metaclock_mac
.
Template configuration file can be automatically generated using metaclock_mac_template_configs
:
metaclock_mac_template_configs -h
usage: metaclock_mac_template_configs [-h] [-d DIRECTORY]
This script will generate a template configuration file for metaclock_mac. And
parameters can be tuned in users' need.
optional arguments:
-h, --help show this help message and exit
-d DIRECTORY, --directory DIRECTORY
Specify the path of directory for storing
configuration file. default: current working
directory.
To display the content of configuration file:
cat config.json
[
{
"input_type":"reads",
"reference_genome":"",
"age_type":1,
"intermediate":"",
"parameter_set":{
"search_report_mode":"-k,1",
"bowtie2_threads":15,
"minimum_mapping_quality":30,
"minimum_mapping_length":30,
"maximum_snp_edit_distance":0.04,
"nproc":3,
"minimum_coverage":5,
"trim_distance":"5:5",
"dominant_allele_frequency":0.8,
"output_trimmed_reads":1,
"samples":""
}
},
{
"input_type":"reads",
"reference_genome":"",
"age_type":2,
"intermediate":"",
"parameter_set":{
"search_report_mode":"-k,1",
"bowtie2_threads":15,
"minimum_mapping_quality":30,
"minimum_mapping_length":30,
"maximum_snp_editrim_distanceance":0.04,
"nproc":3,
"minimum_coverage":5,
"dominant_allele_frequency":0.8,
"samples":""
}
},
{
"input":"contigs",
"reference_genome":"",
"intermediate":"",
"parameter_set":{
"homolog_length":500,
"homolog_length":95,
"homolog_identity":8,
"samples":""
}
}
]
The detailed description for each parameter is available here
Here, we will use default configuration file , configs.json
, already provided in the framework. But please be aware that users are free to customize parameters based on specific cases.
usage: metaclock_mac [-h] [-r REFERENCE] [-a_ipt ANCIENT_METAGENOMES]
[-m_ipt MODERN_METAGENOMES] [-g_ipt GENOME_ASSEMLIES]
[-o OUTPUT_DIR] [-int INTERMEDIATE_DIR] [-c] [-a]
[-snv_rate]
config_file
Reconstruct whole-genome-level MSA from large-scale datasets.
positional arguments:
config_file Input the configuration file.
optional arguments:
-h, --help show this help message and exit
-r REFERENCE, --reference REFERENCE
Input reference genome sequence in the fasta format.
-a_ipt ANCIENT_METAGENOMES, --ancient_metagenomes ANCIENT_METAGENOMES
Input an ancient-metagenome folder containing sub-
folders, each sub-folder contains sequencing reads
from one ancient sample. [Support suffix of
.fastq/.fastq.gz/.fastq.bz2]
-m_ipt MODERN_METAGENOMES, --modern_metagenomes MODERN_METAGENOMES
Input a modern-metagenome folder containing sub-
folders, each sub-folder contains sequencing reads
from one modern sample. [Support suffix of
.fastq/.fastq.gz/.fastq.bz2]
-g_ipt GENOME_ASSEMLIES, --genome_assemlies GENOME_ASSEMLIES
Input a genome-assemblies folder containing assembled
genomes, each assembled genome is stored in a fasta
file. [Support FASTA format]
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Specify an output folder for storing results. Default:
[Mac_output] in the working directory.
-int INTERMEDIATE_DIR, --intermediate_dir INTERMEDIATE_DIR
Specify an intermediate folder for storing
intermediate files. Default: [intermediates] in the
working directory.
-c, --clean Clean intermediate files and rerun from the beginning,
otherwise rerun from the intermediate files
-a, --authentication Autheticate the anicent origin of genomic information
used in genome reconstruction.
-snv_rate, --SNV_rate
Estimate pairwise SNV rates between samples.
Step 3. Automated reconstruction of whole genome alignment from alternative metagenomics data sets
metaclock_mac
is devised to reconstruct genomic information from either contemporary or ancient metagenomic datasets in the form of either sequencing reads or assembled genome contigs and to integrate all in one whole genome alignment for molecular clocking analysis. Here, we will just describe at length three common scenarios:
1.When only sequencing reads of ancient metagenome samples and assembled genomes from contemporary samples are provided:
metaclock_mac configs.json --reference GCA_001639275.fna --ancient_metagenomes ancient_metagenomes --genome_assemlies modern_genome_assemblies --output_dir mac_output --authentication --SNV_rate
2.When sequencing reads of both ancient and contemporary metagenome samples are provided (assembled genomes are not available):
metaclock_mac configs.json --reference GCA_001639275.fna --ancient_metagenomes ancient_metagenomes --modern_metagenomes <metagenomic reads from modern samples> --output_dir mac_output --authentication --SNV_rate
3.When sequencing reads of both ancient and contemporary metagenome samples and genome assemblies are all provided):
metaclock_mac configs.json --reference GCA_001639275.fna --ancient_metagenomes ancient_metagenomes --modern_metagenomes <metagenomic reads from modern samples> --genome_assemlies modern_genome_assemblies --output_dir mac_output --authentication --SNV_rate
metaclock_mac
can address other less common situations, for example, 4) only genome assemblies are provided for either contemporary or ancient samples, 5)* only metagenomics reads are provided for either ancient or contemporary samples.
Step 4. Check results
1) Output Mac_opt.fna
contains the genome alignment reconstructed and output Mac_stats.tsv
contains the corresponding statistics.
To display first 10 lines of each reconstructed genome from input samples:
cat Mac_genome_MSA.fna | grep -A 10 '>'
>GCA_001639275.fna
TAAAAACATTAAATGTCCAAGAAAGATTTCTTAATGAATCCCAACCATCTTTTTCGCAAC
TCAATTGCCTTGTTTTTCATTTTAGTTGAAAAATTACAGAAATCTTCATAAAATTCCTAT
TAAGCCTTACTTGACTATAATATTTCTACATTTCCTACAAAATATATCGCTTAACTTTAA
TAATAAACACTAATTCCGTTTTCTAAATAGATTTTACGCTTATTATAAGATTTTTTAATA
ACATCATGAGAAAAACAATGATTTACAATGAGGATTCATGGATTACTCAATAACCCCATC
TTTGTTAATAAACGCAATCTTCATCATAGAATACCATCGATATACAAAAAAAATTGAACA
GTATTCATCACCAAAACTACATTGCACCCTTTCATCATCATTTTTTCTAACAAGATCAAC
ATTCGAAATAGTTAAATATATATTGGTTGCCATCCAAAAATAAGACATGGAAGACATTTG
TTTTTTTGTTTCATGTATAAATATTTGTCTTCCAACCTATATAAAACTTATTATTGCTCA
AATTAAAAAAATAATTACAAAAACAAGCTAAAAAAATAATTTAAGAAAAAAATCAACAAG
--
>ERR3003614.fna
---------TAAATGTCCAAGAAAGATTTCTTAATGAATCCCAACCATCTTTTTC-CAAC
TCAATTGCCTTGTTTTTCATTTTAGTTGAAAAATTACA-AAATCTTCATAAAATTCCTAT
TAAGCCTTACTTGACTATAATATTTCTACATTTCCTACAAAATATATCGC----------
---------CTAATTCCGTTTTCTAAATAGATTTTACGCTTATTATAAGATTT-TTAATA
A-ATC-------------------------------------------------------
----------AACGCAATCTTCATCATAGAATACCATCGATATACAAAAAAAATTGAACA
GTATTCATCACCAAAACTACATTGCACCCTTTCATCATCATTTTTTCTAACAAGATCAAC
ATTCGAAATAGTTAAATATATATTGGTTGCCATCCAAAAATAAGACATGGAAGACATTTG
-----T-GTTTCATGTATAAATATTTGTCTTCCAACCTATAT------------------
-----AA--AATAAT---------AAGCTAAAAAAATAATTTAAGAA-------------
--
>ERR3003615.fna
-----------AATGTCCAAGAAAGATTTCTTAATGAATCCCAACCATCTTTTTCGCAAC
TCAATTGCCTTGTTTTTCATTTTAGTTGAAAAATTACAGAAATCTTCATAAAATTCCTAT
TAAGCCTTACTTGACTATAATATTTCTACATTTCCTACAAAATATATCGCTTAACTTTAA
TAATA--------------TTTCTAAATAGATTTTACGCTTATTATAAGATTTTTTAATA
ACATCATGAGAAAAACAATGA-TTACAATGAGGATTCATGGATTACTCAATAACCCCATC
TTTGTTAATAAACGCAATCTTCATCATAGAATACCATCGATATACAAAAAAAATTGAACA
GTATTCATCACCAAAACTACATTGCACCCTTTCATCATCATTTTTTCTAACAAGATCAAC
ATTCGAAATAGTTAAATATATATTGGTTGCCATCCAAAAATAAGACATGGAAGACATTTG
TTTTTTTGTTTCATGTATAAATATTTGTCTTCCAACCTATATAAAACTTATTATTGCTCA
AATTAAAAAAATAATTACAAAAACAAGCTAAAAAAATAATTTAAGAAAAAAATCAACAAG
.
.
.
.
.
To display the basic statistics of reconstructed genomes:
cat Mac_stats.tsv
SeqHeader GapRatio AlignLength SeqLength GCcontent
GCA_001639275.fna 0.0 2140433 2140433 0.28
ERR3003614.fna 0.09 2140433 1952882 0.28
ERR3003615.fna 0.06 2140433 2004796 0.28
ERR3003619.fna 0.08 2140433 1968027 0.28
ERR3003621.fna 0.07 2140433 1988501 0.28
ERR3003622.fna 0.1 2140433 1923942 0.28
ERR3003623.fna 0.47 2140433 1134052 0.27
ERR3003632.fna 0.08 2140433 1963576 0.28
ERR3003637.fna 0.1 2140433 1915882 0.28
ERR3003640.fna 0.07 2140433 1988140 0.28
ERR3003644.fna 0.08 2140433 1979625 0.28
ERR3003646.fna 0.1 2140433 1931364 0.28
ERR3003647.fna 0.4 2140433 1286735 0.29
ERR3003651.fna 0.03 2140433 2082972 0.28
SRR6877288.fna 0.1 2140433 1933158 0.28
SRR6877339.fna 0.42 2140433 1248461 0.3
SRR6877399.fna 0.31 2140433 1469709 0.3
GCF_900289035.fna 0.01 2140433 2127511 0.28
GCF_000529525.fna 0.02 2140433 2091372 0.28
GCF_902384065.fna 0.02 2140433 2091372 0.28
2) Anicent DNA damage patterns for authenticating all genomic information from ancient samples damage_pattern.png
:
3) Distribution of pairwise SNV rates among all samples snv_rates_distribution.png
:
4) Heatmap of pairwise SNV rates among all samples snv_rates_heatmap.png
:
Step 5. Tuning parameters by re-running metaclock_mac
with updated configs.json
from intermediate files
Parameters set are too strict or too relaxed? Don't worry. Just update parameters in configs.json
and save it as configs_updated.json
(or other new file names as you like) and rerun the command line without setting flag --clean
:
metaclock_mac configs_updated.json --reference GCA_001639275.fna --ancient_metagenomes ancient_metagenomes --genome_assemlies modern_genome_assemblies --output_dir mac_output_updated --authentication --SNV_rate
metaclock_mac
will re-run the reconstruction based on intermediates files (bowtie2 and blastn outputs) and save you a good deal of time.
If you want to re-run the whole analysis from the scratch:
metaclock_mac configs_updated.json --reference GCA_001639275.fna --ancient_metagenomes ancient_metagenomes --genome_assemlies modern_genome_assemblies --output_dir Mac_output_updated --clean --authentication --SNV_rate