Skip to content

Automated reconstruction

Kun Huang edited this page Jan 25, 2021 · 4 revisions

Reconstructing ancient and modern microbial genomes based on a single reference genome

We will use Methanobrevibacter oralis as an example species to demonstrate how to build a whole-genome alignment comprising strains from ancient and contemporary metagenomics datasets.

Step 1. Data preparation

In this tutorial, metagenomics samples below have been detected M. oralis with a good abundance and sub-sampled for the rapid analysis. Please download them:

Metagenomic reads Genome sequences
ERR3003615 GCF_902384065
ERR3003647 GCF_900289035
ERR3003621 GCA_001639275
ERR3003644 GCF_000529525
ERR3003651
ERR3003619
ERR3003623
ERR3003622
ERR3003637
ERR3003646
ERR3003614
SRR6877288
ERR3003640
ERR3003632
SRR6877399
SRR6877339

Create a working directory called MetaClock :

$ mkdir MetaClock

Change the working directory into MetaClock:

$ cd MetaClock

Under working directory MetaClock, create a directory called ancient_metagenomes and a directory called modern_genome_assemblies :

$ mkdir ancient_metagenomes; mkdir modern_genome_assemblies

Store modern reference genomes GCF_000529525.fna,GCF_900289035.fna,GCF_902384065.fna in the directory modern_genome_assemblies; for preparing ancient samples, create a folder for each ancient metagenomics sample under directory ancient_metagenomes; modern reference GCA_001639275.fna will be used as reference to guide alignment.

Once data preparation is all completed: Firstly, assuming MetaClock is the current working directory, check if MetaClock has GCA_001639275.fna as alignment reference, ancient_metagenomes as ancient sample input, modern_genome_assemblies as modern sample input:

command line:

$ ls .

result:

GCA_001639275.fna ancient_metagenomes modern_genome_assemblies

Secondly, check modern sample input: command line:

$ ls modern_genome_assemblies

result:

GCF_000529525.fna  GCF_900289035.fna  GCF_902384065.fna

Thirdly, check ancient sample input: command line:

$ ls ancient_metagenomes

result:

ERR3003614  ERR3003619  ERR3003622  ERR3003632  ERR3003640  ERR3003646  ERR3003651  SRR6877339
ERR3003615  ERR3003621  ERR3003623  ERR3003637  ERR3003644  ERR3003647  SRR6877288  SRR6877399

*Note: Each metagenomics sample folder can contain multiple FASTQ files from different libraries and possibly in three different file types (fastq.bz2, fastq.gz, fastq). For example, here sample ERR3003632 contains one fastq file compressed in bz2 form: command line:

$ ls ancient_metagenomes/ERR3003632

result:

ERR3003632.fastq.bz2

Step 2. Configuration file preparation Parameter-rich configuration is required for reconstructing whole genome alignment from metagenomics data sets using metaclock_mac. Template configuration file can be automatically generated using metaclock_mac_template_configs :

metaclock_mac_template_configs -h
usage: metaclock_mac_template_configs [-h] [-d DIRECTORY]

This script will generate a template configuration file for metaclock_mac. And
parameters can be tuned in users' need.

optional arguments:
  -h, --help            show this help message and exit
  -d DIRECTORY, --directory DIRECTORY
                        Specify the path of directory for storing
                        configuration file. default: current working
                        directory.

To display the content of configuration file:

cat config.json
[
  {
    "input_type":"reads",
    "reference_genome":"",
    "age_type":1,
    "intermediate":"",
    "parameter_set":{
    "search_report_mode":"-k,1",
    "bowtie2_threads":15,
    "minimum_mapping_quality":30,
    "minimum_mapping_length":30,
    "maximum_snp_edit_distance":0.04,
    "nproc":3,
    "minimum_coverage":5,
    "trim_distance":"5:5",
    "dominant_allele_frequency":0.8,
    "output_trimmed_reads":1,
    "samples":""
  }
 },
 {
  "input_type":"reads",
  "reference_genome":"",
  "age_type":2,
  "intermediate":"",
  "parameter_set":{
  "search_report_mode":"-k,1",
  "bowtie2_threads":15,
  "minimum_mapping_quality":30,
  "minimum_mapping_length":30,
  "maximum_snp_editrim_distanceance":0.04,
  "nproc":3,
  "minimum_coverage":5,
  "dominant_allele_frequency":0.8,
  "samples":""
   }
 },
 {
  "input":"contigs",
  "reference_genome":"",
  "intermediate":"",
  "parameter_set":{
  "homolog_length":500,
  "homolog_length":95,
  "homolog_identity":8,
  "samples":""
   }
 }
]

The detailed description for each parameter is available here

Here, we will use default configuration file , configs.json, already provided in the framework. But please be aware that users are free to customize parameters based on specific cases.

usage: metaclock_mac [-h] [-r REFERENCE] [-a_ipt ANCIENT_METAGENOMES]
                     [-m_ipt MODERN_METAGENOMES] [-g_ipt GENOME_ASSEMLIES]
                     [-o OUTPUT_DIR] [-int INTERMEDIATE_DIR] [-c] [-a]
                     [-snv_rate]
                     config_file

Reconstruct whole-genome-level MSA from large-scale datasets.

positional arguments:
  config_file           Input the configuration file.

optional arguments:
  -h, --help            show this help message and exit
  -r REFERENCE, --reference REFERENCE
                        Input reference genome sequence in the fasta format.
  -a_ipt ANCIENT_METAGENOMES, --ancient_metagenomes ANCIENT_METAGENOMES
                        Input an ancient-metagenome folder containing sub-
                        folders, each sub-folder contains sequencing reads
                        from one ancient sample. [Support suffix of
                        .fastq/.fastq.gz/.fastq.bz2]
  -m_ipt MODERN_METAGENOMES, --modern_metagenomes MODERN_METAGENOMES
                        Input a modern-metagenome folder containing sub-
                        folders, each sub-folder contains sequencing reads
                        from one modern sample. [Support suffix of
                        .fastq/.fastq.gz/.fastq.bz2]
  -g_ipt GENOME_ASSEMLIES, --genome_assemlies GENOME_ASSEMLIES
                        Input a genome-assemblies folder containing assembled
                        genomes, each assembled genome is stored in a fasta
                        file. [Support FASTA format]
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Specify an output folder for storing results. Default:
                        [Mac_output] in the working directory.
  -int INTERMEDIATE_DIR, --intermediate_dir INTERMEDIATE_DIR
                        Specify an intermediate folder for storing
                        intermediate files. Default: [intermediates] in the
                        working directory.
  -c, --clean           Clean intermediate files and rerun from the beginning,
                        otherwise rerun from the intermediate files
  -a, --authentication  Autheticate the anicent origin of genomic information
                        used in genome reconstruction.
  -snv_rate, --SNV_rate
                        Estimate pairwise SNV rates between samples.

Step 3. Automated reconstruction of whole genome alignment from alternative metagenomics data sets

metaclock_mac is devised to reconstruct genomic information from either contemporary or ancient metagenomic datasets in the form of either sequencing reads or assembled genome contigs and to integrate all in one whole genome alignment for molecular clocking analysis. Here, we will just describe at length three common scenarios:

1.When only sequencing reads of ancient metagenome samples and assembled genomes from contemporary samples are provided:

metaclock_mac configs.json --reference GCA_001639275.fna  --ancient_metagenomes ancient_metagenomes --genome_assemlies modern_genome_assemblies --output_dir mac_output --authentication --SNV_rate

2.When sequencing reads of both ancient and contemporary metagenome samples are provided (assembled genomes are not available):

metaclock_mac configs.json --reference GCA_001639275.fna --ancient_metagenomes ancient_metagenomes --modern_metagenomes <metagenomic reads from modern samples> --output_dir mac_output --authentication --SNV_rate

3.When sequencing reads of both ancient and contemporary metagenome samples and genome assemblies are all provided):

metaclock_mac configs.json --reference GCA_001639275.fna --ancient_metagenomes ancient_metagenomes --modern_metagenomes <metagenomic reads from modern samples> --genome_assemlies modern_genome_assemblies --output_dir mac_output --authentication --SNV_rate

metaclock_mac can address other less common situations, for example, 4) only genome assemblies are provided for either contemporary or ancient samples, 5)* only metagenomics reads are provided for either ancient or contemporary samples.

Step 4. Check results

1) Output Mac_opt.fna contains the genome alignment reconstructed and output Mac_stats.tsv contains the corresponding statistics.

To display first 10 lines of each reconstructed genome from input samples:

cat Mac_genome_MSA.fna | grep -A 10 '>'
>GCA_001639275.fna
TAAAAACATTAAATGTCCAAGAAAGATTTCTTAATGAATCCCAACCATCTTTTTCGCAAC
TCAATTGCCTTGTTTTTCATTTTAGTTGAAAAATTACAGAAATCTTCATAAAATTCCTAT
TAAGCCTTACTTGACTATAATATTTCTACATTTCCTACAAAATATATCGCTTAACTTTAA
TAATAAACACTAATTCCGTTTTCTAAATAGATTTTACGCTTATTATAAGATTTTTTAATA
ACATCATGAGAAAAACAATGATTTACAATGAGGATTCATGGATTACTCAATAACCCCATC
TTTGTTAATAAACGCAATCTTCATCATAGAATACCATCGATATACAAAAAAAATTGAACA
GTATTCATCACCAAAACTACATTGCACCCTTTCATCATCATTTTTTCTAACAAGATCAAC
ATTCGAAATAGTTAAATATATATTGGTTGCCATCCAAAAATAAGACATGGAAGACATTTG
TTTTTTTGTTTCATGTATAAATATTTGTCTTCCAACCTATATAAAACTTATTATTGCTCA
AATTAAAAAAATAATTACAAAAACAAGCTAAAAAAATAATTTAAGAAAAAAATCAACAAG
--
>ERR3003614.fna
---------TAAATGTCCAAGAAAGATTTCTTAATGAATCCCAACCATCTTTTTC-CAAC
TCAATTGCCTTGTTTTTCATTTTAGTTGAAAAATTACA-AAATCTTCATAAAATTCCTAT
TAAGCCTTACTTGACTATAATATTTCTACATTTCCTACAAAATATATCGC----------
---------CTAATTCCGTTTTCTAAATAGATTTTACGCTTATTATAAGATTT-TTAATA
A-ATC-------------------------------------------------------
----------AACGCAATCTTCATCATAGAATACCATCGATATACAAAAAAAATTGAACA
GTATTCATCACCAAAACTACATTGCACCCTTTCATCATCATTTTTTCTAACAAGATCAAC
ATTCGAAATAGTTAAATATATATTGGTTGCCATCCAAAAATAAGACATGGAAGACATTTG
-----T-GTTTCATGTATAAATATTTGTCTTCCAACCTATAT------------------
-----AA--AATAAT---------AAGCTAAAAAAATAATTTAAGAA-------------
--
>ERR3003615.fna
-----------AATGTCCAAGAAAGATTTCTTAATGAATCCCAACCATCTTTTTCGCAAC
TCAATTGCCTTGTTTTTCATTTTAGTTGAAAAATTACAGAAATCTTCATAAAATTCCTAT
TAAGCCTTACTTGACTATAATATTTCTACATTTCCTACAAAATATATCGCTTAACTTTAA
TAATA--------------TTTCTAAATAGATTTTACGCTTATTATAAGATTTTTTAATA
ACATCATGAGAAAAACAATGA-TTACAATGAGGATTCATGGATTACTCAATAACCCCATC
TTTGTTAATAAACGCAATCTTCATCATAGAATACCATCGATATACAAAAAAAATTGAACA
GTATTCATCACCAAAACTACATTGCACCCTTTCATCATCATTTTTTCTAACAAGATCAAC
ATTCGAAATAGTTAAATATATATTGGTTGCCATCCAAAAATAAGACATGGAAGACATTTG
TTTTTTTGTTTCATGTATAAATATTTGTCTTCCAACCTATATAAAACTTATTATTGCTCA
AATTAAAAAAATAATTACAAAAACAAGCTAAAAAAATAATTTAAGAAAAAAATCAACAAG
.
.
.
.
.

To display the basic statistics of reconstructed genomes:

cat Mac_stats.tsv
SeqHeader GapRatio  AlignLength SeqLength GCcontent
GCA_001639275.fna 0.0 2140433 2140433 0.28
ERR3003614.fna  0.09  2140433 1952882 0.28
ERR3003615.fna  0.06  2140433 2004796 0.28
ERR3003619.fna  0.08  2140433 1968027 0.28
ERR3003621.fna  0.07  2140433 1988501 0.28
ERR3003622.fna  0.1 2140433 1923942 0.28
ERR3003623.fna  0.47  2140433 1134052 0.27
ERR3003632.fna  0.08  2140433 1963576 0.28
ERR3003637.fna  0.1 2140433 1915882 0.28
ERR3003640.fna  0.07  2140433 1988140 0.28
ERR3003644.fna  0.08  2140433 1979625 0.28
ERR3003646.fna  0.1 2140433 1931364 0.28
ERR3003647.fna  0.4 2140433 1286735 0.29
ERR3003651.fna  0.03  2140433 2082972 0.28
SRR6877288.fna  0.1 2140433 1933158 0.28
SRR6877339.fna  0.42  2140433 1248461 0.3
SRR6877399.fna  0.31  2140433 1469709 0.3
GCF_900289035.fna 0.01  2140433 2127511 0.28
GCF_000529525.fna 0.02  2140433 2091372 0.28
GCF_902384065.fna 0.02  2140433 2091372 0.28

2) Anicent DNA damage patterns for authenticating all genomic information from ancient samples damage_pattern.png:

Ancient DNA damage pattern

3) Distribution of pairwise SNV rates among all samples snv_rates_distribution.png: Pairwise SNV rates distribution

4) Heatmap of pairwise SNV rates among all samples snv_rates_heatmap.png: Heatmap of pairwise SNV rates

Step 5. Tuning parameters by re-running metaclock_mac with updated configs.jsonfrom intermediate files

Parameters set are too strict or too relaxed? Don't worry. Just update parameters in configs.json and save it as configs_updated.json(or other new file names as you like) and rerun the command line without setting flag --clean:

metaclock_mac configs_updated.json --reference GCA_001639275.fna --ancient_metagenomes ancient_metagenomes --genome_assemlies modern_genome_assemblies --output_dir mac_output_updated --authentication --SNV_rate

metaclock_macwill re-run the reconstruction based on intermediates files (bowtie2 and blastn outputs) and save you a good deal of time.

If you want to re-run the whole analysis from the scratch:

metaclock_mac configs_updated.json --reference GCA_001639275.fna --ancient_metagenomes ancient_metagenomes --genome_assemlies modern_genome_assemblies --output_dir Mac_output_updated --clean --authentication --SNV_rate