Assemblies, raw data and scripts from Peris et al 2023.
Authors: David Peris, Emily J. Ubbelohde, Meihua Christina Kuang, Jacek Kominek, Quinn K. Langdon, Marie Adams, Justin A. Koshalek, Amanda Beth Hulfachor, Dana A. Opulente, David J. Hall, Katie Hyma, Justin C. Fay, Jean-Baptiste Leducq, Guillaume Charron, Christian R. Landry, Diego Libkind, Carla Gonçalves, Paula Gonçalves, José Paulo Sampaio, Qi-Ming Wang, Feng-Yan Bai, Russel L. Wrobel, Chris Todd Hittinger
Journal: Nature Communications 16(1): 690.
Year: 2023
Abstract: Species is the fundamental unit to quantify biodiversity. In recent years, the model yeast Saccharomyces cerevisiae has seen an increased number of studies related to its geographical distribution, population structure, and phenotypic diversity. However, seven additional species from the same genus have been less thoroughly studied, which has limited our understanding of the macroevolutionary leading to the diversification of this genus over the last 20 million years. Here, we report the geographies, hosts, substrates, and phylogenetic relationships for approximately 1,800 Saccharomyces strains, covering the complete genus with unprecedented breadth and depth. We generated and analyzed complete genome sequences of 163 strains and phenotyped 128 phylogenetically diverse strains. This dataset provides insights about genetic and phenotypic diversity within and between species and populations, quantifies reticulation and incomplete lineage sorting, and demonstrates how gene flow and selection have affected traits, such as galactose metabolism. These findings elevate the genus Saccharomyces as a model to understand biodiversity and evolution in microbial eukaryotes.
Request information for Saccharomyces strains used in this study can be found in Supplementary Data 1.
The Illumina sequencing reads for this project has been deposed in the Short Reads Archive (SRA) under bioproject PRJNA475869
New nuclear, mitochondrial and genome assemblies for different Saccharomyces lineages assembled by using Paired-End and Mate Pair Illumina libraries. FIles were deposited to the European Nucleotide Archive under bioproject PRJEB48264.
BLAST server: Sac2.0 BLAST server
Strain | Species | Pop | ENA Accession no |
---|---|---|---|
yHAB335 | S.cerevisiae | CHN IV | GCA_918239655 |
yHDPN23 | S.paradoxus | Am-B | GCA_918221835 |
yHDPN24 | S.paradoxus | Am-C | GCA_918281175 |
yHDPN32 | S.paradoxus | EU/Am-A | GCA_918265645 |
IFO1815 | S.mikatae | Asia-A | GCA_918250775 |
yHAB336 | S.mikatae | Asia-B | GCA_918272125 |
IFO1802 | S.kudriavzevii | Asia-A | GCA_918252685 |
IFO1803 | S.kudriavzevii | Asia-B | GCA_918257225 |
yHKS300 | S.kudriavzevii | EU | GCA_918282095 |
ZP591 | S.kudriavzevii | EU | GCA_918268175 |
ZP960 | S.arboricola | Oceania | GCA_918268255 |
CDFM21L.1 | S.eubayanus | HOL | GCA_918271895 |
yHAB94 | S.eubayanus | PA-A | GCA_918265725 |
yHCT69 | S.eubayanus | PA-B | GCA_918298305 |
yHCT99 | S.eubayanus | PA-A | GCA_918246195 |
yHRVM108 | S.eubayanus | HOL | GCA_918265445 |
CBS7001 | S.uvarum | HOL-EU | GCA_918261815 |
yHAB60 | S.uvarum | HOL/SA-A | GCA_918251765 |
yHAB447 | S.uvarum | HOL/SA-A | GCA_918298155 |
yHAB482 | S.uvarum | HOL/SA-A | GCA_918264245 |
yHAB521 | S.uvarum | SA-B | GCA_918280685 |
yHCT77 | S.uvarum | HOL-NA | GCA_918221285 |
ZP964 | S.uvarum | Australasia | GCA_918269355 |
Corrections or genome assemblies used in the recent work. Files were compressed using tar, to uncompress them use tar -xvzf FILENAME
Strain | Species | Pop | Nuclear | mtDNA | 2 micron plasmid |
---|---|---|---|---|---|
NCYC3947 | S.jurei | EU | Naseebetal2018 | Naseebetal2018(Corrected) | Incomplete |
CBS10644 | S.arboricola | Asia-A | Liti et al 2013 | Suloetal2017 | In the nuclear genome |
Other genome assemblies from recent papers stored in Github: 2. Yue et al 2017: CBS432, DBVPG6044, DBVPG6765, N44, SK1, UFRJ50816, UWOPS03-461.4, UWOPS91-917.1, Y12, YPS128, YPS138
The rest of genome assemblies for Saccharomyces strains, using just Paired-End Illumina reads, are in the FigShare repository.
COX2 and COX3 sequences were deposited in GenBank under accession nos. MH813536-MH813939.
GAL genes that were Sanger sequenced were deposited in GenBank under accession nos. OL660614-OL660618.
Raw data, alignments and results were deposited in Figshare. The repository contains:
-
Alignments of 2um plasmid genes REP1 and REP2 used in Supplementary Figure 17.
-
ASTRAL run and outputs to generate Figure 4a.
-
MrBayes and BUCKy raw data and results to generate Figure 4a, Supplementary Figure 13.
-
Single copy ortholog gene and protein alignments annotated using BUSCO for animals and Saccharomyces and used to generate the Figure 3c.
-
GAL/MEL pathway genes and their encoded protein alignments, together with the IQTree log and Maximum Likelihood tree files. This dataset was used to generate Figure 6a, Supplementary Figure 27,28.
-
Alignemtns of annotated genes and their proteins with Yeast Genome Annotation Pipeline (YGAP) and IQtree log and Maximum Likelihood tree files. This dataset was used to generate Figure 4, Supplementary Figure 11, 12, 14.
-
Individual mitochondrial gene alignments used to generate Figure 2, Supplementary Figure 4,. COX2 and COX3 Sanger sequenced alignments to generate Figure 2, Supplementary Figure 3.
-
iWGS genome assemblies for those strains with just Paired-end Illumina reads.
-
Variant Call Format files necessary for population genomic analyses resulting in Figure 3a,b, Supplementary Figure 9, 10, 15.
-
Optican density 600nm values, 96-well plate pictures and additional information for each media condition run necessary to generate Figure 5, 6, Supplementary Figure 18-27, 29.
-
The rest of genome assemblies for Saccharomyces strains.
For in-house scripts, please check instructions by calling reading the script with a text editor (my first scripts) or by running python script.py --help
for the new scripts.
If it is not specified, python scripts were run with python2 version 6+
[]: replace with the corresponding option or indication
[OPTION]: when help is checked you can decide what to use here.
[INPUTFile]: input file
[INPUTF]: path to the input folder containing illumina reads
[OUTPUTF]: output folder to store results
[READ]: read name
[FQEXTENSION]: extension name of the illumina read
python download_SRA_serially.py [INPUTFile] [OUTPUTF] YES
python SRA_rename_v1.4.py [INPUTF] [FQEXTENSION] [INPUTF] [OUTPUTF]
Paired-ended illumina reads
fastqc -o [OUTPUTF] -f fastq [READ]_1.fq
java -jar trimmomatic-0.33.jar PE -threads 4 -phred33 -trimlog analytics.txt [READ]_1.fq [READ]_2.fq [READ]_1.tm.fq [READ]_1.tm-se.fq [READ]_2.tm.fq [READ]_2.tm-se.fq ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 TRAILING:3 MINLEN:25
fastqc -o [OUTPUTF] -f fastq [READ]_1.tm.fq
Mate pair illumina reads
nextclip --input_one [READ]_1.fq --input_two [READ]_2.fq --output_prefix trimmed --log log.txt --trim_ends 5
[INPUTF]: path to the input folder containing a strain folder with the PE illumina reads for that strain
[KMER]: Number of kmers to be used
[RUNNAME]: output file name
python2.7 AAF_pipeline-kmer.py -i [INPUTF] -o [RUNNAME] -k [KMER]
Convert the generated infile file to MEGA format and open it in MEGA to generate a Neighbor-Joining phylogenetic tree.
[NAME].ctl: describe the information about PE and MP illumina reads, assemblers to use, checkpoints to perform and quality assessment of the genome assemblies (example included in the scripts folder STRAINNAME.ctl)
XXXXreads2iWGS2Assembly.py
./iWGS -s [NAME].ctl --Real
XXXXqualimap
[GENOMEANAME]: genome assembly name
[INPUTF]: input folder
[INPUTFile]: input file
python runMummer_serially.py [INPUTFile]
Generated coords file is useful in the next step
Rscript --vanilla reorderContigsUsingNucmer.R [GENOMEANAME].coords [GENOMEANAME].scaffolds.fa
A file called remap.coords.result.csv is generated with the information about scaffolds mapping to chromosomes and if they were reverse complemented. This is useful information for manual curation in Geneious.
python combining_orderedScaffolds.py [INPUTF]
python generating_FinalAssembly.py -l 10000 -i [INPUTF]
Some strains will need manual curation in Geneious. After Geneious curation, run mummer for confirmation.
python runMummer_serially.py [INPUTFile]
#Correction step using pilon
python mapping-and-pilon_serially.py -i [INPUTFile] -t [OPTION] -q [OPTION] -e [OPTION]
[INPUTName]: name of the input
[INPUTF]: path to the input folder containing genome assemblies
[INPUTFile]: input file
[OUTPUTF]: output folder to store results
[RUNNAME]: output file name
[REFGENOME]: the reference genome where reads will be mapped
python seq_gc_count.py [INPUTF] [OUTPUTF] --window 25000 --window_print --min_contig 20000
python seq_gc_plot.py [OUTPUTF] --window 25000 --out test.pdf --out_stats [RUNNAME].txt --min_contig 20000 --plot_len
python infoseqout_v1.1.py -o [OUTPUTF] -f [INPUTFile]
python quast.py --fast -t 8 -o [RUNNAME] --eukaryote -R [REFGENOME] --gage -l [INPUTName] [INPUTFile]
[INPUTF]: path to the input folder containing SPAdes genome assemblies
[INPUTFile]: input file
[OUTPUTF]: output folder to store results
[INPUTf]: list with the sequence IDs (i.e. strain names) to be retained
python infoseqout_v1.1.py -o [OUTPUTF] -f [INPUTFile]
python filter_fasta_v3.py -i [INPUTf] -o [OUTPUTF] -p [OPTION] -f [INPUTF]
#In the table look for a scaffold with a size between 40-90 Kbp
#Annotate with [MFannot](https://megasun.bch.umontreal.ca/cgi-bin/mfannot/mfannotInterface.pl "MFannot")
#Convert sequin output to genbank with Sequin v16.0. Open genbank file in Geneious
#Sort the fasta sequence to start with trnS(tga), export fasta and reannotate in MFannot. Repeat previous step.
#Extract gene sequences and make multisequence alignments
[INPUTFile]: input file
python runMummer_serially.py [INPUTFile]
[FASTA]: input fasta file with target gene sequences, follow HybPiper target fasta format
[INPUTFile]: strains and read paths to explore
[PATH2HybPiper]: path to HybPiper folder with HybPiper scripts
[OUTPUTF]: output folder to store results
[iFASTAln] :input fasta alignment
[OUTPUTFile]: output file
[BLASTF]: folder with stored blast databases
[BLASTl]: list with name databases to use
[ASSEMBLY]: assembly or text file with a list of assemblies
#Blast
mkblastdb_v5.1.py -a [ASSEMBLY] -o [BLASTF]
run_blast_db_v3.0.py -i [FASTA] -o [OUTPUTF] -b [BLASTl] -d [BLASTF]
#or HybPiper
python HybPiper_serially_v2.py -t [FASTA] -i [INPUTFile] -y [PATH2HybPiper] -o [OUTPUTF] -s [OPTION] -p [OPTION] -d [OPTION] --threads [OPTION]
trimal -in [iFASTAln] -out [OUTPUTFile] -htmlout summary_stats -mega -gt 0.7
trimal -in [iFASTAln] -out [OUTPUTFile] -htmlout summary_stats -fasta -gt 0.7
#or blastp in Geneious R6
[INPUTF]: folder with assemblies
[OUTPUTF]: output folder to store results
[iFASTAln] :input fasta alignment
[OUTPUTFile]: output file
python runBUSCO_serially.py -i [INPUTF] -t [OPTION] -o [OUTPUTF]
python busco_get_single_genes_nt.py [OUTPUTF] --out single_genes_alignment_all
python rename_names_fasta-busco.py [OUTPUTF] -o [OUTPUTF2]
MAFFT_serially.py -i [OUTPUTF2] -o [OUTPUTF3] -t [OPTION]
trimal -in [iFASTAln] -out [OUTPUTFile] -fasta -gt 0.9
#in folder where mafft alignments are stored, run FASconCAT.pl to concatenate
perl FASconCAT.pl -s
[INPUTFile]: list of assemblies to use
[INPUTf]: input file
[INPUTF]: folder with assemblies and gff
[INPUTFfasta]: folder with the rename files or gene/protein sequences
[OUTPUTF]: output folder to store results
[FASTAf]: list of fasta to parse
python prepareBUSCO_v2.py -i [INPUTFile] -t [OPTION] -l BUSCO_Tirant-layout.sh -s eukaryota_odb10 -d [OPTION]]
genome2genes_combineGenes.py -i [INPUTf] -f [INPUTFfasta] -o [OUTPUTF]
MAFFT_pal2nal_serially_v1ScarcityCONDOR.py -s [OUTPUTF] -g [GENENAME] #It was run in a CONDOR script to submit multiple alignments at the same time
#Get some statistics
python getBUSCO_statsTirant.py -f [OUTPUTF] -i [INPUTf] -s eukaryota_odb10
python checkLength_OriginalvsTrimmed.py -i genes_trimmed/ -o genes_aligned/ -a [OPTION] -t [OPTION]
#Continue BUSCO pipeline
python pull_buscoTirant.py -d eukaryota_odb10 -l [FASTAf] -o [OUTPUTF] -s [OPTION]
Genome annotation with YGAP
[INPUTf]: input file
[INPUTF]: folder with assemblies and gff
[OUTPUTF]: output folder to store results
[INPUTFfasta]: folder with the rename files for the strains
YGAP_down_serially.py -i [INPUTf] -o [OUTPUTF]
#Convert Genbank file to gff in Geneious
#Paralogue gene annotation > final gff file
#Extract individual genes
python genome2genes_v1.2.py -i [INPUTF] -a [OPTION] -c [OPTION] #includes the step to use process_gff_cds_proteins.pl
genome2genes_combineGenes.py -i [INPUTf] -f [INPUTFfasta] -o [OUTPUTF]
MAFFT_pal2nal_serially_v1ScarcityCONDOR.py -s [OUTPUTF] -g [GENENAME] #It was run in a CONDOR script to submit multiple alignments at the same time
#Percentage amino acid identity in R as indicated in the manuscript.
[INPUTf]: list with the sequence IDs (i.e. strain names) to be retained
[INPUTF]: folder with gene/protein trimmed sequence alignments
[OUTPUTF]: output folder to store results
[INPUTN]: input name
[OUTPUTN]: output name
python filter_fasta_v3.py -i [INPUTf] -o [OUTPUTF] -p [OPTION] -f [INPUTF]
#CONDOR script to parallelize multiple IQTree runs that contains the next code line
iqtree -s $(iGENE)_nt_tr_fil.fas -bb 1000 -wbt -nt AUTO -seed 225494 -st DNA -m TEST
#ASTRAL pipeline
cat results/*.treefile > YGAP_ML_best.trees
java -jar astral.5.7.7.jar -i YGAP_ML_best.trees -o Saccharomyces_species_pp.tre 2> [RUNNAME].log
#Add the CFs
iqtree -t Saccharomyces_species_pp.tre --gcf YGAP_ML_best.trees --prefix gene_concordance
#RAxML with the SNP dataset generated by mapping reads to reference (see below)
raxmlHPC-PTHREADS-SSE3 -T 4 --no-bfgs -f d -m ASC_GTRGAMMA --asc-corr=stamatakis -p 54877 -s [INPUTN]_AllGenomes_SNPs.fasta -#100 -n [OUTPUTN] -w [OUTPUTF]/best_tree/ -q part
raxmlHPC-PTHREADS-SSE3 -T 4 --no-bfgs -m ASC_GTRGAMMA --asc-corr=stamatakis -#1000 -b 12345 -s [INPUTN]_AllGenomes_SNPs.fasta -n [OUTPUTN] -t [OUTPUTF]/best_tree/RAxML_bestTree.SNP -w [OUTPUTF]/boots/ -q part
raxmlHPC-PTHREADS-SSE3 -T 4 --no-bfgs -p 12345 -m ASC_GTRCAT --asc-corr=stamatakis -f b -t [OUTPUTF]/best_tree/RAxML_bestTree.SNP -n [OUTPUTN] -z [OUTPUTF]/boots/RAxML_bootstrap.SNP -w [OUTPUTF]/bipartitions/ -q part
#Phylonetwork, just open your fasta file in SplitsTree
#Supernetworks/Galled Trees/ SuperTrees, open your collection of phylogenetic trees in SplitsTree or Dendoscrope
[iFASTAln] :input fasta alignment
[oFASTAln] :output fasta alignment
[Hapf]: haplotype file generated by DnaSP
[OUTPUTf]: output file
[INPUTf]: input file
[InfoStrain]: table with information about strains and traits for PopART
perl fastaSortByName.pl [iFASTAln] > [oFASTAln]
#Assign haplotype numbers in DnaSPv5
python NexusHap2Table.py -i [Hapf] -o [OUTPUTf]
Rscript Hap2Freq2Nex.R -i [INPUTf] -s [InfoStrain] -c [OPTION] -o [OUTPUTf]
#Add the new generate information to a nexus file for preparing the file for PopART
#Run PopART TCS method
[INPUTF]: folder with sequence alignments
[INPUTb]: folder with *.in files
[OUTPUTF]: output folder to store results
[OUTGROUP]: strain name of the outgroup to root the tree
[#GENES2PICK]: number of genes to pick
python genes2bucky_v1.0.py -i [INPUTF] -o [OUTPUTF] -l [OPTION] -m MrBayes_param.txt -g [OUTGROUP] -r [OPTION] -t [OPTION] -n [OPTION] > [LOGFILE].txt
bucky -a 1 -k 3 -n 100000 -c 2 -o [OUTPUTF] --calculate-pairs --create-joint-file --create-single-file [INPUTb]/*.in
#You want to randomize which gene you pick for bucky analyses or computational expensive programs
python ~/software/scripts/bucky_RandomPick_v1.0.py -s [#GENES2PICK] -a 1 -i [INPUTF] -o [OUTPUTF]
[INPUTf]: input file
[INPUTF]: input folder
[REFGENOME]: the reference genome where reads will be mapped
[OUTPUTCov]: mapping4SNP_v2.0_multipleMaps output file with coverage
chromosome.txt: a list of the chromosome names (example included in scripts folder)
[OUTPUTf]: output file
[OUTPUTF]: output folder
pairwiseDivFile_chromosome.txt: tab tabulated file with information of Chromosome name, size and position of start in a concatenated format (example included in scripts folder)
python mapping4SNP_v2.0_multipleMaps.py -i [INPUTf] -t [OPTION] -q [OPTION] -r [REFGENOME] -m [OPTION] -W [OPTION] -p [OPTION] #It runs multiple scripts and programs (bwa, samtools, picard, GTAK, genomeCoverageBed)
python filter_fasta_v3.py -i [INPUTf] -o [OUTPUTF] -p [OPTION] -f [INPUTF] #In case you want to remove strains run in mapping4SNP_v2.0_multipleMaps.py
#FASTA alignments split by chromosomes were processed as follows:
python splitFASTAintoWindows.py -m chromosome.txt -W [OPTION] -i [INPUTF] -o [OUTPUTF] -t [OPTION] -T [OPTION]
[INPUTf]: input file
[OUTPUTf]: output file
[INPUTF]: input folder
[OUTGROUP]: strain name of the outgroup to root the tree
[OUTPUTN]: output name
[OUTPUTF]: output folder
[REFGENOME]: the reference genome where reads will be mapped
[InfoStrain]: table with information about strains and population designation
[QAF]: folder where qaf.tab is stored
pairwiseDivFile_chromosome.txt: tab tabulated file with information of Chromosome name, size and position of start in a concatenated format (example included in scripts folder)
[VCF]: merged vcf generated by ConvertOutputs_Mapping2GenomPopAnalysisTools.py
[INPUTN]: input name
#Combined fasta (from mapping4SNP_v2.0_multipleMaps.py) was trimmed with trimal -gt 0.9 and opened in MEGA v7 to get diversity stats and generate a fasta just for variant sites.
#Variant sites were opened in DnaSP v5 to get polymorphism statistics
#Diversity stats using PopGenome
python check_alignments4PopGenome.py -i [INPUTF] #Input folder contains the fasta splited by windows
trimal -in [INPUTf] -out [OUTPUTf] -gt 0.3
Rscript pairwiseDiversity_validSitesComp_v2.1.R [INPUTF] pairwiseDivFile_chromosome.txt [InfoStrain] [OUTPUTN] #Input folder contains the fasta splited by windows
#Convert results from mapping4SNP_v2.0_multipleMaps.py to other formats
#STRUCTURE
python ConvertOutputs_Mapping2GenomPopAnalysisTools.py -i [INPUTf] -t [OPTION] -r [REFGENOME] -o [OUTPUTN] -s [OUTGROUP] -f [INPUTF] -W [OPTION] -R [OPTION]
python structure_serially.py -k [OPTION] -r [OPTION] -o [OUTPUTF] -m [OPTION] -e [OPTION]
#FineSTRUCTURE
python FASTAtoFineStr_v2.0.py -i [INPUTf] -o [OUTPUTN] -s [InfoStrain] -c pairwiseDivFile_chromosome.txt -r Scer_recombinationRate.csv -e [OPTION]
fs [INPUTN]_AllGenomes_fs.cp -idfile [INPUTN]_AllGenomes_fineStr.idfile -phasefiles [INPUTN]_AllGenomes_fineStr.phase -recombfiles [INPUTN]_AllGenomes_fineStr.recomb -ploidy 1 -go
fs -X -Y -e X2 [INPUTN]_AllGenomes_fs_linked_hap.chunkcounts.out [INPUTN]_AllGenomes_fs_linked_hap_tree_run0.xml [INPUTN]_AllGenomes_fs_linked_hap_mapstate.csv
fs -X -Y -e meancoincidence [INPUTN]_AllGenomes_fs_linked_hap.chunkcounts.out [INPUTN]_AllGenomes_fs_linked_hap_mcmc_run0.xml [INPUTN]_AllGenomes_fs_linked_hap_meancoincidence.csv
RScript fineStructure_plots_v2.0.R -p [OUTPUTN] -s [OPTION] -I [InfoStrain] -i [INPUTF] -N [OPTION]
#Convert to PCAdmix
python FASTAtoPCAdmix_v2.0.py -i [INPUTf] -p [OUTPUTN] -c pairwiseDivFile_chromosome.txt -s [InfoStrain] -r Scer_recombinationRate.csv
#Convert to PED format
python FASTAtoPED_v2.py -i [INPUTf] -o [OUTPUTN] -p [OPTION] -r Scer_recombinationRate.csv -s [InfoStrain] -c pairwiseDivFile_chromosome.txt
#Heterozygosity plots
Rscript snp_HTZplot_v1.r -i [QAF] -o [OUTPUTN]
python heterozigosity_distribution.py -i [INPUTF] -p pairwiseDivFile_chromosome.txt -f [OPTION]
Rscript snp_distrib_v2.r [OUTPUTN].qaf.tab pairwiseDivFile_chromosome.txt
#Convert to PLINK, PCAdmix, ADMIXTOOLS, Genesis in one script
python PopAnalysisFormatGenerator_v1.py -i [VCF] -o [OUTPUTN] -s [InfoStrain] -m [OPTION] -r [OPTION] -Q [OPTION] -c Scer_recombinationRate.csv