-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathCode_repository_vignette.Rmd
125 lines (78 loc) · 8.38 KB
/
Code_repository_vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
title: "Code Repository Vignette"
author: "David Wilkinson"
date: "`r Sys.Date()`"
output:
html_document:
css: lodestar.css
toc: yes
toc_float:
collapsed: no
toc_depth: 4
---
This repository stores the code to replicate the analysis from Wilkinson et al (in review) using your own data.
## Formatting your data
As we do not own the data used in our analysis we have not provided it online as part of the repository. Here we provide instructions for how to replicate our analysis methodology on your own data.
You will require presence/absence data for your species and measured environmental variables at all sites. Different models require data to be provided in different formats.
### MLR
Data must be supplied as follows:
+ Separate `.txt` files for species and environmental data where rows correspond to sites and columns to species and variables respectively. **Do not include column names**. Include an intercept column (all `1`s) at the front of the environmental data.
### MPR
Data can be supplied as either of the following with some minor code alterations:
+ A single `.csv` file with species observations and environmental variables as columns, and a row for each site.
+ As separate `.csv` files for species and environmental data.
### HPR
Data can be supplied as either of the following with some minor code alterations:
+ A single `.csv` file with species observations and environmental variables as columns, and a row for each site.
+ As separate `.csv` files for species and environmental data.
### LPR
Data can be supplied as either of the following with some minor code alterations:
+ A single `.csv` file with species observations and environmental variables as columns, and a row for each site.
+ As separate `.csv` files for species and environmental data.
### DPR
Data can be supplied as either of the following with some minor code alterations:
+ A single `.csv` file with species observations and environmental variables as columns, and a row for each site.
+ As separate `.csv` files for species and environmental data.
### HLR-NS
Data must be supplied in a very specific format for this model. For more detailed information refer to the supplementary material of Ovaskainen *et al* (2016a). Replace all instances of `<dataset>` with the name of your dataset.
+ `y_<dataset>.csv`: Your presence-absence data **with no column names**. Sites as rows, species as columns.
+ `X_<dataset>.csv`: Your covariate data **with no column names**. Sites as rows, variables as columns. The first column must be all `1`s for the intercept, the remaining columns are your measured variables.
+ `LF_units_<dataset>.csv`: The model allows for latent factors to operate at different scales (e.g. plot, county, state, etc). To set all latent factors to work at the same level (as in this analysis) you must provide a single column of `1` to the number of sites by `1` (e.g. `1`, `2`, ..., `100` for `100` sites).
+ `dist_<dataset>.csv`: A single column of length equal to the number of species. This represents the statistical distribution assumed for the data. It should all be `2` for the probit model.
+ `species_<dataset>.txt`: A text file of species names (same order as `y_<dataset>.csv`). Each species on its own line.
+ `covariates_<dataset>.txt`: A text file of variable names (same order as `X_<dataset>.csv`). Each covariate on its own line. Don't forget the intercept column!
### HLR-S
Data must be supplied in a very specific format for this model. For more detailed information refer to the supplementary material of Ovaskainen *et al* (2016a). Replace all instances of `<dataset>` with the name of your dataset.
+ `y_<dataset>.csv`: Your presence-absence data **with no column names**. Sites as rows, species as columns.
+ `X_<dataset>.csv`: Your covariate data **with no column names**. Sites as rows, variables as columns. The first column must be all `1`s for the intercept, the remaining columns are your measured variables.
+ `LF_xy_<dataset>_1.csv`: The spatial (or temporal) coordinates of the units. The number of columns corresponds to the spatial dimension and can be arbitrary. This analysis was run with two columns: latitude and longitude (no column names!), but you are not restricted to this coordinate system.
+ `LF_units_<dataset>.csv`: The model allows for latent factors to operate at different scales (e.g. plot, county, state, etc). To set all latent factors to work at the same level (as in this analysis) you must provide a single column of `1` to the number of sites by `1` (e.g. `1`, `2`, ..., `100` for `100` sites).
+ `LF_alpha_<dataset>_1.csv`: The discrete grid prior for alpha (the parameter controlling decay in correlation with spatial distance). The first column gives the values, the second their weights. The default setting is the first column goes from `0` to `1` by `0.01`, and the second column is all `0.005` except for the first row which is `0.5`.
+ `dist_<dataset>.csv`: A single column of length equal to the number of species. This represents the statistical distribution assumed for the data. It should all be `2` for the probit model.
+ `species_<dataset>.txt`: A text file of species names (same order as `y_<dataset>.csv`). Each species on its own line.
+ `covariates_<dataset>.txt`: A text file of variable names (same order as `X_<dataset>.csv`). Each covariate on its own line. Don't forget the intercept column!
## Fitting models to data
### MLR
Run the `MCMC-final-<dataset>.nb` file in `Mathematica` after modifying the file path to your files. You also need to hard code in your variable names in the four lines that follow the `SparseArray` function for loading your presence/absence data:
+ A vector of your species names in this format: `{"Species 1", "Species 2", ..., "Species N"}`.
+ A vector of `1`s of length equal to the number of environmental variables (including the intercept!) in this format: `{1,1,1,1}`.
+ A vector of variable names (but `intercept` must be called `constant`) in this format: `{"constant", "Var1", ..., "VarK"}`.
+ A vector of variable names (but `intercept` must be called `constant`) in this format: `{{"constant"}, {"Var1"}, ..., {"VarK"}}`.
### MPR
Run the `analysis_script.R` script in `R` after modifying the file paths for your data files. Run the `Beta-R-extraction-script.R` following the first's completion after modifying your file paths.
### HPR
Run the `Pollock_<dataset>.R` script in `R` after modifying the file paths for your data files. Run the `data_extraction.R` following the first's completion after modifying your file paths.
### LPR
Run the `boralJSDM.R` script in `R` after modifying the file paths for your data files. Run the `boral_data_extraction.R` following the first's completion after modifying your file paths.
### DPR
Run the `ClarkJSDM.R` script in `R` after modifying the file paths for your data files. Run the `data_extraction_Clark.R` following the first's completion after modifying your file paths.
### HLR_NS
Run the `model_definitions.m` file in `MATLAB` after setting `basefolder` to the project directory file path, setting `datafolder` to the `data` folder inside `basefolder` and `example` to the name of your dataset. Run `A2_HMSC.m`, `A3_show_estimates.m`, `A4_generate_predictions.m` (after setting `example` to your dataset name), and `A5_<dataset>.m` in order.
Then run `Ovaskainen 2016 NS Data Extraction.R` in `R` to extract the outputs in the required formats.
### HLR_S
Run the `model_definitions.m` file in `MATLAB` after setting `basefolder` to the project directory file path, setting `datafolder` to the `data` folder inside `basefolder` and `example` to the name of your dataset. Run `A2_HMSC.m`, `A3_show_estimates.m`, `A4_generate_predictions.m` (after setting `example` to your dataset name), and `A5_<dataset>.m` in order.
Then run `Ovaskainen 2016 Data Extraction.R` in `R` to extract the outputs in the required formats.
## Sampler efficiency
The sampler efficiency comparison replicates the above model fitting on the frog dataset but with the same MCMC regime across all models. Use the scripts in this folder instead of the separate ones in the main analysis folders.
## Plotting
Take all of the `.csv` files generated from each model's respective data extraction scripts and copy them into the plotting directory. There should be a copy in the folder for each dataset *and* copy in the `All Species` folder. Then run the relevant plotting scripts after changing the respective file paths.