A short analysis of population statistics given specific inputs
Note: this repo is only meant to be used for the authors' purposes and is not meant for broader use. The main purpose is to increase reproducibility for the authors' manuscripts that use this repo. There are no guarantees of usefulness beyond this use.
Also see Appendices for additional specific use-case instructions:
Appendix A - Baseline Summary Benchmark
Requirements:
Installation of related requires special instructions; for Windows or linux follow the tutorial, and for mac there are instructions in the readme for install from .tar.gz:
https://github.com/timothyfrasier/related
Required files are automatically identified on network, or if off network:
# Put essential files in 00_archive, including
# Extraction sheets: Standard abbreviations for species naming: SK, CN, CO, CM, PK, and STH
<species>_PBT_ES.txt or <species>_mix_PBT_ES.txt (base/mix ES)
# Hotspot file:
<species>_hotspot_detail.txt (hotspot file)
# Stock code file:
SNP only: <sp>StockCodesCU.txt or <sp>mixCodes.txt (base/mix stock codes)
microsat only: <sp>StockCodes_microsat.txt
# The input file can be anywhere on your computer, but you can also put it in 02_input_data
Note: new experimental (caution! will overwrite existing files!) function to update files from a source location to a target location (specifically your 00_archive):
update_essential_files()
...where your new essential files (those above) are in "C:/00_GitHub/essential_files/00_GSI_essentials/"
Source 01_scripts/simple_pop_stats_start.R
to activate all functions.
Select if you are on the DFO network or running local.
Select the species being analyzed from the available list.
NOTE: On first install, packages will need to be installed. They are currently commented out at the top of 01_scripts/simple_pop_stats_start.R
. Uncomment to install manually, or if using Rstudio use the yellow box at the top to install. Biocmanager and SNPrelate may require additional, manual effort eg.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SNPRelate")
Most scripts have been evaluated on R3.6.3 and RStudio 1.2.5019 on Windows 10. Additional testing has been performed on Linux, and occasionally R4.0.2, but would recommend continuing with R3.X for now if possible.
By default, the results will go into directories within this repo, unless a custom output director is chosen:
set_output_folder() # works interactively
set_output_folder(result_path= "<your/full/path>")
# note: the directory must already exist
NOTES: Specific observations in microsats, in regards to the input genepop file, please ensure that...
- there is a "POP" line is present between collections, and only contains the string "POP" without anything else, otherwise will not import
- collection names cannot have trailing underscores (e.g., not
Ash_
) as the script will remove it and no longer match stock codes - the second column contains a four- or three-digit numeric year separator (e.g., not
Nass_test 99 1
, but ratherNass_test 9999 1
) (note: three digits was enabled to allow for the cod 999, which is used often as a replacement for an unknown year) - the collection name cannot contain four digits surrounded by underscores or spaces (e.g., not
Nitinat_1997 1997 1
, ratherNitinat97 1997 1
) - the original names.dat file specifies a named region for ALL collections, not a blank (e.g., not
Bulkley NA
)
- "Separated" - the option of
separated = TRUE
andFALSE
is presented in downstreamcalculate_FST
andmake_tree
functions. Please note - this is solely about naming the output file, and has no additional effect on the function - ifseparated = TRUE
in themake_tree
orcalculate_FST
function, it will attempt to add the value ofsep_by
to the file-name. The valuesep_by
here is defined by the selection in theupdate_pop_names
script. As the microsatellite function does not currently output a value forsep_by
whenfix_pop_names
is run, separated == FALSE for all functions downstream of Section 2.
Load a genepop with the following, using 'SNP' or 'microsat':
load_genepop(datatype = "SNP")
Your data will be put into the 'obj', which is a genind object.
To see what populations you have:
unique(pop(obj))
If you are working with MGL_GSI_SNP data, where the sample IDs are all in the composition of:
stockcode_year_indivID_sex
, e.g. 2_2000_9_U
...you can use the following script to rename your populations:
update_pop_names(sep_by = "collection", name_by = "stockname")
Other options:
sep_by
- collection
- collection_and_year
- none (do nothing)
name_by
- stockname
- none (do nothing to keep as stock code)
New: the option add_CU
can be used to append the CU from the stock code file to the population if you use the following command (only compatible with the following):
update_pop_names(sep_by = "collection", name_by = "stockname", add_CU = TRUE)
Made a mistake? Don't worry, just recover your original genind:
obj <- obj.bck
If working with MGL_GSI microsat data, first make a stock code file similar to the SNP stock code file, by inputting a names file that has been formatted as tab-delimited:
prepare_stockcode_file(fix_names = TRUE, names_file.FN = "/path/to/your/tab/delim/names/file.txt")
Then clean up your population names using the following:
fix_pop_names(data = obj, append_region = TRUE, stockcode.FN = "/path/to/your/new/stockcode/file.txt")
note: caution, if everything before the space is not unique between populations could cause populations to be merged. There are some checks in place to prevent this, but good to be aware of it.
If you are re-defining stock codes for a specific analysis in SNPs, you can't use the default stock code file. If you have an alternate file, you need to redefine sc.base
. If you do, use:
sc.base <- "00_archive/custom_stockcodes.txt
As an example.
To find the number of samples, markers, alleles, and sample size per population, use the following:
characterize_genepop(df = obj, pdf_width = 30, pdf_height = 6, cex_names = 0.3, N=30)
Will produce a barchart of sample size per pop (sample_size*.pdf
).
Set the parameters based on your plotting requirements, the above is an example for a very large number of pops. Use N
to set where the line will be drawn across the plot.
To characterize your loci, run the following to get a summary per locus including Fst, Fit, Fis, Hobs, Hexp:
per_locus_stats(data = obj)
...this will produce in 03_results/per_locus_stats_<YYYY-MM-DD>.txt
To remove loci, use the following script that can allow you to remove monomorphic loci, or remove loci using a tab-delimited file with a single column with marker names that are to be removed from the object. This will end up as obj_filt
.
drop_loci(drop_monomorphic = TRUE, drop_file = <path/to/drop/file.txt>)
To remove populations based on a minimum sample size, or based on a tab-delimited text file, use the following:
drop_pops(df = obj_filt, drop_by_pop_size = TRUE, min_indiv = 35, drop_file = NULL, drop_unused = FALSE, names.dat.FN = NULL, drop_indivs = NULL)
Change drop_file
from NULL to a filename to drop named populations. The drop file should be tab-delimited, only one column, with no header.
note: if you have attached regions previously, this will not work as the pattern matching is to the name in the text file.
Change drop_indivs
from NULL to drop specific individuals.
note: the drop_indivs will occur prior to pop size filter; the pop size filter will operate on the remaining samples.
New: drop_unused = TRUE
will allow for the dropping of region codes 98 + 99 from microsatellites. If true, the names.dat.FN
should point to the tab delimited names.dat file created for prepare_stockcode_file
in step 2.
This step will generate obj_pop_filt
.
downsample_pops(data = obj_filt, subset_method = "chosen", set_sample_size = 40)
Other opts: "chosen", "average", "min"
In case you want to use downstream applications for microsat data, you need to replace microsat pop names with SNP pop names. So create a crosswalk file, with the following format:
(Note: first line is header names, keep as shown, following lines are custom):
datatype1, datatype2
popname_microsat, popname_SNP
...save it, and then load it using:
connect_two_datatypes(df = obj_pop_filt, crosswalk.FN = "path/to/crosswalk/file")
To keep certain populations based on a list. Works the exact opposite to the drop_file
from Drop pops.
keep_pops(df = obj_filt, keep_file = NULL)
Change keep_file
from NULL to a path to a file containing names of collections to keep. The keep file should be tab-delimited, with collections in the first column, with no header. Additional columns in file beyond the first are ignored.
note: if you have attached regions or CUs in Step 2 previously, this will not work as the pattern matching is to the name in the text file. If you have done so, please attach the region/CU to the collection name.
eg. Tatchun_R_MYR
instead of Tatchun_R
For this step, you will need your data prepared for analysis in hierfstat, so use the following:
If you only have a genind saved as obj_pop_filt in this example, and it has been separated by something above:
calculate_FST(format="genind", dat = obj_pop_filt, separated = TRUE)
...note: if it has not been separated, run with separated = FALSE.
If you already have a hierfstat object:
calculate_FST(format="hierfstat", dat = obj_pop_filt, separated = FALSE)
...note: same as above, you can use separated=TRUE.
...note: if you used the above to format from genind to hf your hf will be the obj_pop_filt.hf
This will assign your results as pairwise_wc_fst
object within the R environment, and save the results to the 03_results
folder as a file called gen_diff_wcfst_<*>.csv
by default.
If you want to have a custom filename for your FST csv file, use the argument cust_fn
for your basename, which will automatically save into 03_results
.
Note - separated = FALSE for microsatellites (see 00.E)
To build a bootstrapped tree using the aboot
function in poppr:
make_tree(bootstrap = TRUE, boot_obj = obj_pop_filt, nboots = 10000, dist_metric = "edwards.dist", separated = TRUE)
NOTE: Above bootstrapping did not work after upgrading Ape from 5.3 to 5.4. It should be fixed as of Ape 5.4-1 according to: http://ape-package.ird.fr/NEWS
You can also build a non-bootstrapped tree using the previous genetic differentiation object and the NJ
function in phangorn. (bootstrap
must equal FALSE
and tree_method = "NJ"
in this scenario):
make_tree(matrix = pairwise_wc_fst, tree_method = "NJ", separated = TRUE, bootstrap = FALSE)
NOTE: If you need to load your pairwise_wc_fst
from the output file produced by calculate_FST
, please use as an example:
pairwise_wc_fst <- as.matrix(read.table(file="03_results/gen_diff_wcfst_<*>.csv",row.names = 1, header=TRUE,sep=","))
Conduct PCA using:
pca_from_genind(data = obj_pop_filt
, PCs_ret = 3
, plot_eigen=TRUE
, plot_allele_loadings=TRUE
, colour_file = NULL
, retain_pca_obj = TRUE
, parallel = FALSE # set to TRUE to use parallel for pca
)
This will output results into 03_results
, including pca_scores_per_sample.txt
to know each sample's positions on the retained PCA axes (also see pca.obj in global environment).
Note that you can determine variance explained per axis by accessing pca.obj$eig, and can normalize to a percentage by simply taking the eigenvalue as a percentage of the sum of all eigenvalues. more info from adegenet
Conduct DAPC using:
dapc_from_genind(data = obj_pop_filt
, plot_allele_loadings = TRUE # plot and export the discrim. fn. locus variance contributions?
, colour_file = NULL # use custom colours
, n.pca = 10, n.da = 1 # number PC axes to use and discriminant functions to retain
, scree.da = TRUE # plot scree plot for DF
, scree.pca = TRUE, posi.pca = "topright" # plot PCA scree plot
, dapc.width = 7, dapc.height = 5 # PDF filesize for scatterplot
)
Set the number PCA and DA axes to consider as needed (see ?dapc() for more details).
To use custom colours, set the path to a csv file with header 'collection' and 'colour' containing the population name and colour name to set your custom colours.
Note: if you retain more than one discrimant function, the allele loading plot will plot variance contributions of the first two discriminant functions only.
First, convert data (SNP or microsat) from genind to relatedness format and calculate relatedness values:
relatedness_calc(data = obj_pop_filt, datatype = "SNP")
...this will output to 03_results/kinship_analysis_<date>.Rdata
Note: if you are using microsat data, this will depend that you have run 'Rename microsat pops to SNP pop names' above, so that the pop names can be converted to stock codes.
Second, plot your results:
relatedness_plot(file = "03_results/kinship_analysis_<date>.Rdata", same_pops = TRUE, plot_by = "names")
note: if you don't want to convert the current pop designations (e.g., if you don't have a stock code file), then set plot_by = "codes
to avoid any conversion.
...where you can use either "names" or "codes" if using only same-on-same.
...and if you set same_pops
to FALSE, you will get all pops pairwise comparisons. (but can't use names)
To run a test of whether markers are following Hardy-Weinberg proportions, use the following function to find the number and percentage of markers per population that deviate from H-W proportions:
hwe_eval(data = <obj>, alpha = 0.01)
...to produce 'HWE_result_alpha_0.01.txt'
If you have GPS coordinates in the stock code file, you can automatically calculate the distance between all pairs of locations using the following script:
calculate_physical_dist()
...which will output 03_results/physical_distance.txt
Using this file, along with an earlier calculated FST (output of calculate_FST()
above), use the following:
compare_phys_genet_dist(FST_file = "03_results/<your_FST_file>.csv")
...which will put your results into 03_results/pairwise_fst_v_physical_dist.pdf
AMOVA will use repunits and collections to see where the variance exists in your data.
To create a repunit file de novo, run:
calculate_AMOVA(data = obj_pop_filt, build_file = TRUE)
This will output 00_archive/unique_pops.csv
, and fill this out with a new column entitled repunit
to show the higher level groupings in your data.
Once you have a file describing the repunits, run the following:
calculate_AMOVA(data = obj_pop_filt, build_file = FALSE, missing_treat = "mean")
The results will be output into obj_amova
and obj_amova.pegas
.
Other options:
- mean = impute missing
- ignore = do nothing
- zero = convert NA to 0
- genotype = drop indiv w/ missing
Calculate allele frequencies per population using the following:
calculate_allele_freq()
This will output 'freq.df', but to add the actual genotype alleles to the file, use the following:
build_allele_freq_table(freq_file = freq.df)
Note: this assumes that the hotspot file that is currently active is the same as the one that has been used to score all of the samples in your dataset.
NOTE: Script only works if region was NOT appended in step 2 (fix_pop_names()
or update_pop_names()
depending on datatype).
Convert genepop to rubias (SNP):
genepop_to_rubias_SNP(data = obj, sample_type = "reference")
Convert genepop to rubias (microsatellite):
genepop_to_rubias_microsat(data = obj_pop_filt, sample_type = "reference", micro_stock_code.FN = "</path/to/stock/repunit/file>")
For sample_type - must be a choice of "reference" or "mixture" - in almost all cases the default "reference" will be preferred in this pipeline, for both SNPs and microsats.
For microsatellite data To convert the microsat data to rubias, the repunits per stock code must be specified. Use the same stock code file as was used to rename data in step 'Rename microsat data' above. It can be named anything, but make sure it has column names 'collection', and 'repunit', and the collection names should match those ones in the data.
This will output 'rubias_output_microsat.txt' in your results folder, which can be used for simulations (below).
Please note: this currently assumes you have stock name followed by a four digit year identifier in your individual name.
If using custom format (i.e., not in MGL_GSI_SNP format), use the following (SNP only):
genepop_to_rubias_SNP(data = obj, sample_type = "reference", custom_format = TRUE, micro_stock_code.FN = micro_stock_code.FN)
This will require that you have a stock code file (assign the variable micro_stock_code.FN
), as well as a tab-delim file, 02_input_data/my_data_ind-to-pop_annot.txt
in the format of:
indiv pop
101 JPN
1847 FRA
Allele files are created using XXX, but in any case you won't have the sample name anymore. To add this back onto the dataframe, use:
add_sample_name(input_type = "allele", base = TRUE)
...after which you will be prompted to select an allele file (i.e., output of rubias_to_alleles()
of MGL_GSI_Analysis.
Method to separate a rubias baseline file (SNP only) by year for downstream simulations
sep_rubias_by_yr(rubias_base.FN = <rubias_base.FN)>
Note: could use additional testing to validate method.
Using a rubias baseline, perform simulations to test baseline power using the rubias function assess_reference_loo() as follows:
For 100% simulations
full_sim(rubias_base.FN = <full path to rubias baseline>, num_sim_indiv = 200, sim_reps = 100)
...saves output into 03_results, including:
- collection_100_stats_YYYY-MM-DD.txt (main item: summary info of sum of all iteration assignments)
- collection_100_stats_all_pops_YYYY-MM-DD.txt (full info on collection)
- collection_100_stats_all_reps_YYYY-MM-DD.txt (full info on repunits)
- all_collection_results_YYYY-MM-DD.txt.gz (raw output of all sims)
- matrix-style table (used in specific cases, typically with reduced regional baseline analyses) Save to a separate folder to make sure they don't get written over.
Parallel option to speed up (for linux only), add flag: ncore_linux = <# cores to use>
.
To install follow instructions in full_sim.R because requires a 'remote' repo install.
Assume 20 Gb per thread. Note: if it produces warnings, it is likely there will be populations that did not complete due to lack of RAM (start over with fewer threads).
Additional rollups:
- If
region
is specified as a header in the repunits file, can add flag:region=TRUE
to output summary files rolled up to region in addition to repunit. - If you had other non-repunit groupings of collections to test, can add flag:
custom_rollup=TRUE
andcustom_rollup.FN =<path/to/custom/groupings/file>
. Custom groupings file is expected to have headersCode collection group Display_Order
with group defining custom reporting units for each collection.
Note: Additional rollups are considered one or other - cannot run both at the same time currently.
Note: use thse following function to create a 'Display Order' sorted result file for browsing using the repunit file (do not use for plotting):
format_sims_output()
# and select manually your input 100% sims stats text file. This will output to your results file folder.
For realistic fishery scenarios, first use the following function to make a proportions template, then manually edit the file to the custom proportions:
# Prepare proportion file
prep_ppn(rubias_base.FN = <rubias.FN>, ppn.FN = "ppn.txt")
# ...then edit to add proportions in <result.path>/<ppn.FN>
# Run simulation using the ppn file
full_sim(rubias_base.FN = <rubias.FN>
, num_sim_indiv = 200
, sim_reps = 100
, proportions.FN = paste0(current_path_to_set, "ppn_north_orgn.txt")
)
NOTE: not tested with region = TRUE,
Alternate option: can also run 100% simulations using Oncor for the microsatellite data.
Use MGL_GSI README to export an oncor baseline and grouping file. Groupings can be easily edited.
Format of the grouping file:
TITLE LINE
Basin_Cr_RT\t Alsek
Bear_Slough_RT\t Taku
Open oncor.
File>Open Baseline data (and load your baseline)
File>Open Reporting Groups (and load your reporting groups)
Mixture Analysis>100% simulations
Set parameters (e.g., as defaults), and run the analysis.
Note: if you get an error, make sure that there are no cases where a single pop has no data whatsoever for at least one marker (e.g., make sure to remove the SNPs for sockeye, if all pops do not have the SNP data).
Plot summaries of assignment per repunit based on results of the 100% simulations.
Requires 2 files produced in full_sim
, and these are selected interactively so they can be anywhere on your computer.
- collection_100_stats_all_reps_YYYY-MM-DD.txt (not just the top repunit assignment)
- collection_100_stats_all_pops_YYYY-MM-DD.txt (not just the top collection assignment)
plot_summarize_100_sim(axis_label="repunit",repunits_file = TRUE,
plot_prefix = "summarize_100_sim",plot_colls = FALSE,
width=8,height=11,pdf_png="pdf")
# plot_colls = TRUE # will plot by individual collections instead of summarized by repunit
# plot_prefix # allows to change the output filename
# repunits_file = TRUE # allows for selection (interactively) of a regional roll-up if a column is added to the repunits file 'region' (beta)
# also allows for choosing axis label (e.g., CU, CU_Name, repunit) (beta)
If microsats - repunits_file = FALSE
is a must, or you need to create a repunits file. Minimum necessary columns are:
"Display_Order", "CU", "CU_NAME", and "repunit"
With Display_Order column used to sort the figure from lowest (top) to highest (bottom). Without the repunits file it will still plot, but default to reverse alphebetical.
Note: could use the - collection_100_stats_YYYY-MM-DD.txt file probably, but would break the regional roll-up.
Summarize a filtered rubias base (SNP or microsat), and export a table reporting the repunit, CU number, collection, years per collection, and total N.
This is a commonly produced table for publication or baseline benchmarks.
Both the microsatellite and SNP versions require a repunit file. This file should be tab-delimited, with columns "Display_Order", "CU", "CU_NAME", and "repunit".
summarise_rubias_baseline(baseline = <rubias_base filename>
, out_prefix = "rubias_base_summary"
, repunit_desc = <repunits filename>
, by_year=<TRUE or FALSE>
, type = "<SNP or microsat>"
)
This function takes a filtered baseline, an output prefix and a repunit file. Will export to results path.
Do not provide a path in the output prefix, only a prefix for the output file.
Tacked on to the prefix will be .baseline_summary.txt
and output will be tab delimited.
Example:
With by_year=FALSE
, no tally by year will be conducted:
Region/Conservation Unit | CU Number | Population | Years | N |
Kalum_early timing | 49 | CEDAR_RIVER | 1996 | 20 |
Kalum_late timing | 50 | KITSUMKALUM_RIVER-LOWER | 2013, 2014, 2015, 2016 | 559 |
Zymoetz | 80 | THOMAS_CREEK | 2004, 2009, 2010 | 96 |
Sicintine | 81 | SICINTINE_RIVER | 2010 | 115 |
Middle Skeena-mainstem tributaries | 54 | BULKLEY_RIVER-LOWER | 1999 | 96 |
KISPIOX_RIVER | 2004, 2006, 2008, 2010 | 98 | ||
KITSEGUECLA_RIVER | 2009 | 95 | ||
KITWANGA_RIVER | 2003 | 93 | ||
KULDO_CREEK | 2008, 2009 | 95 |
With by_year=TRUE
, tally by year will be conducted:
Region/Conservation Unit | CU Number | Population | Years(N) | N |
Middle Skeena-mainstem tributaries | 54 | BULKLEY_RIVER-LOWER | 1999(96) | 96 |
KISPIOX_RIVER | 2004(59), 2006(28), 2008(3), 2010(8) | 98 | ||
KITSEGUECLA_RIVER | 2009(95) | 95 | ||
KITWANGA_RIVER | 2003(93) | 93 |
To create a summary of collections that are known to contain Juveniles, run:
juvenile_count(by_year=TRUE)
by_year = TRUE
will identify specific years in the baseline, while by_year = FALSE
will simply report the collection name.
ONLY TESTED IN SNPS - CHINOOK + COHO
To determine the highest tray number in a rubias baseline, run:
highest_tray()
It will prompt to select a rubias baseline, then match the individual IDs to the baseline extraction sheet. The highest tray number for an individual in the baseline will be exported to a file <two.letter.code>_highest_tray_number.txt
in the folder 03_results/
ONLY TESTED IN SNPS
To convert a rubias file to a more common VCF format, use:
rubias_to_vcf(output.fn = "03_results/convert_from_rubias.vcf")
It will request to interactively select a *_rubias.txt
file and a *.hs.bed
in the Proton hotspot format. It will also use the default <species>_hotspot_detail.txt
file to do some naming conversions.
This file should deal properly with insertions and deletions markers. It was tested in our coho panel, but not yet in others so beware. Will only work in SNPs - uSat would require significant work to make happen here.