Skip to content

Latest commit

 

History

History
executable file
·
68 lines (47 loc) · 7.73 KB

README.md

File metadata and controls

executable file
·
68 lines (47 loc) · 7.73 KB

Overview

This repository contains the code required for the Simulation component of the IMLS_GCCO marker comparison project, which is quantifying how measures of ex situ conservation differ between RADseq and microsatellite markers. Simulations using different marker types (microsatellites, or "MSAT", and SNP markers, or "DNA") are performed using the fastSimcoal2 software and the strataG R package.

RScripts

This folder contains all the R scripts used for our analyses. Scripts are numbered (alphanumerically) according to the order they're run in the workflow. The purpose of these scripts is summarized below.

Functions

Functions used repeatedly throughout the scripts are declared in 0_functions.R. These functions are roughly grouped according to their purpose in the workflow.

Run simulations

There are 3 scripts named _simParams_. Each of these scripts runs through the following steps, using the fastSimcoal simulation software and the R package strataG:

  1. declare simulation parameters,
  2. run simulations according to parameters,
  3. convert the Arelquin objects generated by simulations into objects files (for resampling steps; see below),
  4. save the genind objects containing simulation data to the disk. The script will also "clean up" the simulation outputs, removing some files and moving others to certain directories.

These steps populate the files found in the SimulationOutputs directory (see below). The genind files generated in Step 4 are read in at the beginning of the 3_resampleAndBootstrap.R script.

The 3 scripts correspond to 3 different analyses, as summarized below: - N1200: These simulations use a total population size (nInd) of 1200 individuals. MSAT and DNA markers are simulated - N4800: These simulations use a total population size (nInd) of 4800 individuals. MSAT and DNA markers are simulated - lowMutDNA: These simulations use a lower mutation rate (1e-8) than what's used in the DNA simulations (both N1200 and N4800) described above (1e-7). Both N1200 and N4800 total population sizes are simulated in this script.

Check simulations

The script 2_senseCheck.R performs different checks on the genind files generated from Step #1 (running simulations using fastSimcoal and strataG). These checks test our expectations of the simulation scenarios based on biology--for instance, that average Fst values will be greater for scenarios with lower migration rates between populations.

Perform resampling and loci bootstrapping

The script 3_resampleAndBootstrap.R is used to summarize the allele frequency distributions of each simulation scenario (for each marker type), and then randomly sample each object to model ex situ conservation. The script reads in the genind files saved to disk, and then using these files, generates resampling arrays, with the following dimensions

  • rows: number of randomly selected samples, for which allelic representation is calculated
  • columns: allelelic representation categories - Column 1: the Total allelic representation (all categories of alleles) - Columns 2--5: the representation values for alleles of different frequency categories (Very common, Common, Low frequency, Rare)
  • slices: each array slice represents a different, independent resampling replicate. For these scenarios, the current number of resampling replicates is 5

The resampling arrays generated in this script are read in at the beginning of the 4_linearModeling.R script.

This resampling script is built to run in the background of a server, with multiple clusters processing in parallel. For this reason, it uses many parallelized apply-family functions. Variable flags are used to indicate which of the analysis scenarios (specified above; N1200, N4800, DNA low mutation) to run resampling analyses on. In addition, for all scenarios except the DNA low mutation scenario, we explore using different levels of filtration for minor allele counts. Functionally, this means removing MSAT and DNA alleles which only appear once (N1) or twice (N2) in the entire population. We compare these results to no filter for minor allele count (N0).

Test case: 14 IUCN Red List Threatened U.S. Oak Species

The script reads in the allelic resampling data arrays generated for simulations generated in Rosenberger et al. 2022, which created species-specific simulations of 14 oak species native to the U.S. It then generates MSSEs and their prediction intervals, for each species.

Loci bootstrapping

Loci boostrapping happens simultaneously to resampling. For each level of loci that's used for the loci bootstrapping, a resampling array (of 5 simulation slices) is generated. Therefore, for each dataset (e.g. MSAT N1200, DNA N4800, etc.), the final output is a list of resampling arrays.

Linear modeling

The script 4_linearModeling.R reads in resampling arrays and uses them to calculate the minimum samples size estimates (MSSE) required for representing 95% of the total allelic diversity found within a population using both MSAT and DNA markers. It also explores the simulation parameters that most greatly impact these MSSEs.

Miscellaneous scripts

The script archived_SimulationsScenarios.R contains simulation parameters for scenarios we explored, but ultimately ended up not using. The script is kept here for documentation purposes.

The script demo_dnaMarkerIssue.R demonstrates the issue discovered when trying to simulation SNP ("DNA") markers using just fastSimcoal. Markers will be concatenated, such that 3 blocks of chromosomes each 500 bp long will, in the resulting genind file, result in a single locus 1,500 bp long. This motivated our usage of strataG.

SimulationOutputs

This folder contains 7 subfolders: 6 containing outputs for simulations used in our analyses, as outlined below, and 1 containing the outputs for demonstrating a DNA marker concatenation issue we found using fastSimcoal (see above):

  • N1200
    • MSAT
    • DNA
  • N4800
    • MSAT
    • DNA
  • DNA low mutation
    • N1200
    • N4800

In each of these subfolders, there are directories for each of our 6 simulation scenarios (2 migration rates * 3 different population configurations). Within those directories are 2 files: a .log file, which describes the steps taken by strataG and fastSimcoal in generating simulations, and a .par file, which contains the simulation parameters.

We are unable to keep the actual files generated from simulations (Arlequin and genind files) on this repostory, due to file size restrictions (see our .gitignore file for more details).

Contact

For questions about this project, open an Issue or contact Austin Koontz.