ch 2,3,4

ncss-tech · Jan 18, 2025 · c681192 · c681192
1 parent 753c23b
commit c681192
Show file tree

Hide file tree

Showing 3 changed files with 130 additions and 58 deletions.
diff --git a/Part1/02-data.Rmd b/Part1/02-data.Rmd
@@ -799,7 +799,9 @@ Anecdote about using `dplyr` for very simple base R things, and deprecation/pack
 
 ### Missing Values
 
-"Missing" values are encoded in R as `NA`. `table()` has a `useNA` argument that affects whether `NA` values are counted. Use `is.na()` to return a logical value that identifies missing data.
+"Missing" values are encoded in R as `NA`. `table()` has a `useNA` argument that affects whether `NA` values are counted. 
+
+Use `is.na()` to return a logical value that identifies missing data.
 
 ```{r, eval=TRUE}
 table(pedons$taxsubgrp, useNA = "ifany")
@@ -1015,7 +1017,48 @@ Tools for totally ordered indexed observations -- especially irregular time seri
 
 ## Exercise 2: O Horizon Thickness
 
-This exercise will use a synthetic dataset of O horizon thicknesses modeled after https://acsess.onlinelibrary.wiley.com/doi/epdf/10.2136/sh1993.1.0022
+This exercise uses a synthetic transect generated with R code. The 10 hypothetical pedons along the transect have all been described using the a standardized set of horizon designations: Oi, Oe, A, C, Cg. The thicknesses of layers, the depth to gleyed matrix, and soil organic carbon content vary between pedons. 
+
+We will start to examine this dataset graphically, and then will utilize some functions defined for the _SoilProfileCollection_ to extract information from the data.
+
+```{r, eval=TRUE, echo=FALSE}
+load(url("https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/book/02/ohz.rda"))
+plotSPC(ohz)
+```
+
+The function `thicknessOf()` can be used to calculate the thickness of horizons matching a regular expression pattern. For example, to calculate the O horizon thickness we can match the pattern `"O"`:
+
+```{r, eval=TRUE}
+ohz_thickness <- thicknessOf(ohz, pattern = "O", prefix = "O_")
+ohz_thickness
+```
+
+We will use the `site()` LEFT JOIN method to merge the data into the _SoilProfileCollection_.
+
+```{r, eval=TRUE}
+site(ohz) <- ohz_thickness
+```
+
+Now, let's plot the profiles in order of cumulative O horizon thickness. We draw lines at 20 and 40cm to indicate the thresholds for histic epipedon and Histosols soil order:
+
+```{r, eval=TRUE}
+plotSPC(ohz, plot.order = order(ohz$O_thickness))
+abline(h = c(20, 40), lty = 2)
+```
+
+Using the object `ohz` object we just inspected, answer the following questions using techniques from the previous sections. Check your logic by comparing to the plot above. 
+
+1. Check if any soil organic carbon values are `NA`. Make a profile sketch (`plotSPC()`) using organic carbon percentage (`"soc"`) to color the profiles. Then calculate a new column using the function `is.na()` and use that to color the profiles. Which profiles have missing data? In which horizons?
+
+2. Tabulate unique horizon designations (`ohz$name`). How many of each horizon designation are present?
+
+3. Use regular expressions to count the number of Cg horizons in the collection.
+
+4. Filter the `ohz` object using `aqp::subset()` to select pedons where `O_thickness` is greater than `20` and less than `40`. This is the histic epipedon thickness requirement. How many pedons have a histic epipedon?
+
+5. Filter the `ohz` object using `aqp::subset()` to select pedons where `O_thickness` is greater than `40`. This is the Histosols order thickness requirement. How many pedons are Histosols?
+
+Submit an .R script containing code used to answer 1 through 5 to your mentor.
 
 ## `fetchNASIS()` data checks
 
@@ -1043,31 +1086,16 @@ NOTE: some records are missing rock fragment volume
 
 ### Inspecting Results
 
-Here we inspect occurrence of andic soil properties in MT647.
+We will analyze occurrence of andic soil properties in pedons from MT647. We will download this "selected set" from the course website from an *.rda* file to save you the effort of crafting your selected set just for this example.
 
-We will download this "selected set" from the course website as an *.rda* file to save you the effort of crafting your selected set just for this example.
-
-```{r, eval=TRUE}
-example.data.dir <- "C:/workspace2"
-example.data.path <- file.path(example.data.dir, "mt647.rda")
-
-if (!dir.exists(example.data.dir))
-  dir.create(example.data.dir, recursive = TRUE)
-
-download.file("https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/book/02/mt647.rda", 
-              destfile = example.data.path)
-```
-
-Downloading and installing the above .rda is equivalent to doing NASIS query "\_PedonPC_Plus_DataDump_select" (MLRA04 Bozeman) for to fill your selected set for User Site ID: `%MT647%`, NASIS Site: `SSRO_Northwest`, NASIS Group: `NW-MIS Point Data`, followed by `mt647 <- fetchNASIS()`.
+Downloading and installing the .rda is equivalent to NASIS query _SSRO_Northwest: \_PedonPC_Plus_DataDump_select_ for User Site ID: `%MT647%`, NASIS Site: `SSRO_Northwest`, and NASIS Group: `NW-MIS Point Data`.  After populating your NASIS selected set `mt647` is created in R with `mt647 <- fetchNASIS()`.
 
 #### Load Example Data
 
-To load the sample object data into R, just use `load()` and the path to the *.rda* file (`example.data.path` or `"C:/workspace2/mt647.rda"`)
+To load the sample object data into R, just use `load()` and the path to the *.rda* file. In this case we are using a remote path, so we use `url()` to open the connection:
 
 ```{r eval=TRUE}
-# load the sample data
-example.data.dir <- "C:/workspace2"
-load(file.path(example.data.dir, "mt647.rda"))
+load(url("https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/book/02/mt647.rda"))
 ```
 
 ```{r, eval=TRUE}
@@ -1091,7 +1119,7 @@ get('bad.pedon.ids', envir = soilDB.env)
 ```
 
 ```{r, eval=TRUE}
-# by default, rmHzErrors removes the "bad" illogical pedons
+# rmHzErrors=TRUE removes the "bad" illogical pedons
 any(mt647$upedonid %in% get('bad.pedon.ids', envir=soilDB.env))
 ```
 
@@ -1186,27 +1214,12 @@ There are a variety of calculated fields that are included in the default `fetch
 
 Below is a summary of additional information that can be readily brought into R from your NASIS selected set via the `get_extended_data_from_NASIS()` function.
 
-To download the sample `2015MT663%` data from the course page with R:
-
-```{r, eval=TRUE}
-example.data.dir <- "C:/workspace2"
-example.data.path <- file.path(example.data.dir, "mt663.rda")
-
-if (!dir.exists(example.data.dir))
-  dir.create(example.data.dir, recursive = TRUE)
-
-download.file("https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/book/02/mt663.rda", 
-              destfile = example.data.path)
-```
-
 Before continuing, *imagine* opening the NASIS client, populating your selected set with `2015MT663%` using a query like "*NSSC Pangaea – POINT-Pedon/Site by User Pedon ID*"
 
 Load the data like we did above.
 
 ```{r eval=TRUE}
-# load the sample data
-example.data.dir <- "C:/workspace2"
-load(file.path(example.data.dir, "mt663.rda"))
+load(url("https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/book/02/mt663.rda"))
 ```
 
 ```{r}

diff --git a/Part1/03-eda-application.Rmd b/Part1/03-eda-application.Rmd
@@ -317,7 +317,7 @@ diff(quantile(clay, p = c(0.25, 0.75)))
 
 ### Weighted Summaries
 
-**Weighted Mean** - is the weighted arithmetic average. In the ordinary arithmetic average, each data point is assigned the same weight, whereas in a weighted mean the contribution of each data point to the final average can vary. 
+**Weighted Mean** - is the weighted arithmetic average. In the ordinary arithmetic average, each data point is assigned the same weight, whereas in a weighted average the contribution of each data point to the mean can vary. 
 
 The function `weighted.mean()` takes two major inputs: `x`, a vector of data values, and `w`, a vector of weights. Like other statistics, you can remove missing values with `na.rm=TRUE`.
 
@@ -329,13 +329,17 @@ h$hzthk <- h$hzdepb - h$hzdept
 weighted.mean(h$claytotest, h$hzthk, na.rm = TRUE)
 ```
 
-The above example is a little contrived since it is applied to all profiles and all horizons: so we only get one value out across `r length(loafercreek)` profiles and `r nrow(loafercreek)` horizons. 
+The above example is a little contrived since it is applied to all profiles and all horizons. We only get one value out across `r length(loafercreek)` profiles and `r nrow(loafercreek)` horizons. 
 
-More commonly use the weighted mean to depth-weighted averages for individual pedons or components and mapunit averages. Depth-weighted averages commonly use horizon thicknesses as weights. Mapunit averages commonly use component percentages as weights. In other cases the weights may be derived from expert knowledge or some other source when we want to scale the contribution of particular observations or variables to a final result. 
+More commonly we use the weighted mean for depth-weighted averages in individual pedons or components and, also, for mapunit averages. 
+
+Depth-weighted averages commonly use horizon thicknesses as weights. Mapunit averages commonly use component percentages as weights. 
+
+When we want to scale the contribution of particular observations or variables to a final result, weights may be derived from expert knowledge or some other data source. 
 
 #### Profile-specific Weighted Calculations
 
-A more practical application of `weighted.mean()` is Particle Size Control Section weighted average properties. For this we need to use the horizon thickness _within the control section_ as weights.
+A more practical application of `weighted.mean()` is Particle Size Control Section (PSCS) weighted average properties. For this we need to use the horizon thickness _within the control section_ as weights. 
 
 The aqp function [`trunc()`](https://ncss-tech.github.io/aqp/reference/glom.html) allows you to remove portions of profiles that are outside specified depth intervals (such as the PSCS boundaries). [`profileApply()`](https://ncss-tech.github.io/aqp/reference/profileApply.html) allows for iterating over all profiles in a _SoilProfileCollection_  to call some function on each one. 
 

diff --git a/Part1/04-spatial.Rmd b/Part1/04-spatial.Rmd
@@ -799,7 +799,7 @@ Rasters can represent both continuous and categorical (factor) variables. Raster
 
 The categories can be viewed using the `levels()` and `terra::cats()` functions. You can have multiple category columns, and the "active" category can be set with `terra::activeCat()`.
 
-You can use the `terra::classify()` function to assign integer values to each pixel that can be the basis for you categories. Then, you can set the categories associated.
+You can use the `terra::classify()` function to assign integer values to each pixel that can be the basis for your categories. Then, you can set the category labels associated with each integer value.
 
 For example,  we classify the terra sample elevation dataset into high and low elevation areas. We supply a reclassification matrix of values with three columns. The first column is the "low" end of the class range, the second column is the high end of the class range. The third column contains the new values to assign.
 
@@ -953,7 +953,7 @@ See [this article](https://kokoalberti.com/articles/geotiff-compression-optimiza
 In this exercise you will create a continuous and categorical slope gradient maps from a digital elevation model.
 
 1. Use the sample Tahoe-area LiDAR elevation dataset from the gdalUtilities package or your own dataset as input. If you use your own data, you may want to make a smaller extent with `[terra::crop()]`
- - `tahoe <- terra::rast(system.file("extdata", "tahoe_lidar_bareearth.tif", package = "gdalUtilities"))`
+   - `tahoe <- terra::rast(system.file("extdata", "tahoe_lidar_bareearth.tif", package = "gdalUtilities"))`
 
 2. Run `terra::terrain()` to create a slope map with `unit="radians"`
 
@@ -963,11 +963,11 @@ In this exercise you will create a continuous and categorical slope gradient map
 
 5. Use `terra::classify()` to create a categorical map of slope classes. Use the following breaks and assign the integer values `1` through `5` from lowest to highest.
 
- - 0 to 3%
- - 3 to 15%
- - 15 to 30%
- - 30 to 60%
- - \>60% 
+   - 0 to 3%
+   - 3 to 15%
+   - 15 to 30%
+   - 30 to 60%
+   - \>60% 
 
 6. Use `levels()` to set the categories of your raster.
 
@@ -1361,41 +1361,48 @@ dplyr::group_by(m, mlra, name) %>%
 tapply(m$value, list(m$mlra, m$name), mean, na.rm = TRUE)
 ```
 
-### Example: Faster with `exactextractr`
+### Zonal Statistics with `exactextractr`
 
 This example shows how to determine the distribution of Frost-Free Days across a soil series extent.
 
-The data are extracted from the raster data source very rapidly using the `exactextractr` package.
+First, we load some sample data. We use `soilDB::seriesExtent()` to get some extent polygons for two series of interest. Then we load some PRISM-derived Frost Free Day estimates.
 
-```{r fig.width=6, fig.height=5, results='hide', eval=TRUE}
+```{r results='hide', eval=TRUE}
 library(sf)
 library(soilDB)
 library(terra)
 library(lattice)
 library(exactextractr)
 
-# 5-10 seconds to download Series Extent Explorer data 
 series <- c('holland', 'san joaquin')
-
-# make SpatialPolygonsDataFrame
 s <- do.call('rbind', lapply(series, seriesExtent))
 
-# load pointer to PRISM data
+# load PRISM data
 r <- rast('C:/workspace2/chapter-4/FFD.tif')
 
-# transform extent to CRS of raster with sf
-s <- st_transform(st_as_sf(s), crs = st_crs(r))
+# inspect
+r
+
+# transform extent to CRS of raster 
+s <- st_transform(s, crs = st_crs(r))
 
 # inspect
 s
+```
+
+#### Directly Returning Extracted Values
 
+Data are extracted from the raster data source very rapidly using the `exactextractr` package.
+
+```{r, fig.width=6, fig.height=5, eval=TRUE}
 # use `st_union(s)` to create a MULTI- POINT/LINE/POLYGON from single
 # use `sf::st_cast(s, 'POLYGON')` to create other types
 
 system.time({ ex <- exactextractr::exact_extract(r, s) })
 
 # ex is a list(), with data.frame [value, coverage_fraction]
 #  for each polygon in s (we have one MULTIPOLYGON per series)
+head(ex[[1]])
 
 # combine all list elements `ex` into single data.frame `ex.all`
 #  - use do.call('rbind', ...) to stack data.frames row-wise
@@ -1417,6 +1424,54 @@ densityplot(~ value | group, data = ex.all,
 )
 ```
 
+#### Predefined Summary Operations
+
+In the previous example we extracted all values and their coverage fractions into memory so we could make a graph with them in R. This operation does not scale as well to very large rasters where all values would not fit in memory.
+
+[`exactextractr::exact_extract()`](https://search.r-project.org/CRAN/refmans/exactextractr/html/exact_extract.html) has multiple built-in summary statistics we can use. These summary statistics can be processed very efficiently as all pixels do not to be loaded into memory at once. The available methods include weighted variants that account for pixel coverage fraction. You can specify summary options using the `fun` argument. 
+For example, `fun="quantile"` calculates quantiles of cell values, weighted by coverage fraction. We used two MULTIPOLYGON geometries corresponding to two series extents, so we get two sets of 3 quantiles for the Frost Free Days (FFD) grid. 
+
+```{r, eval=TRUE}
+ex2 <- exactextractr::exact_extract(
+  r,
+  s,
+  fun = "quantile",
+  quantiles = c(0.05, 0.5, 0.95),
+  full_colnames = TRUE,
+  append_cols = "series"
+)
+
+ex2
+```
+
+The list of summary operations available for use in `exact_extract()` `fun` argument includes: `"min"`, `"max"`, `"count"`, `"sum"`, `"mean"`, `"median"`, `"quantile"`, `"mode"`, `"majority"`, `"minority"`, `"variety"`, `"variance"`, `"stdev"`, `"coefficient_of_variation"`, `"weighted_mean"`, `"weighted_sum"`, `"weighted_stdev"`, `"weighted_variance"`, `"frac"`, and `"weighted_frac."`
+
+Of  interest beyond the typical summary statistics are  compositional summaries: `"frac"` and `"weighted_frac"`. These two methods calculate composition of unique levels of the raster for the input features.
+
+For example, imagine an interpretation for crop suitability. One requirement of this hypothetical crop is a growing season length greater than 250 days. We will estimate the growing season length using Frost Free Days. First we create a classified raster based on our criteria, then we summarize the raster data using the polygon boundaries and `fun="frac"`:
+
+```{r, eval=TRUE}
+# calculate a binary raster
+#   0: not suitable
+#   1: suitable
+r2 <- r > 250
+levels(r2) <- data.frame(values = 0:1,
+                         suitability = c("Not suitable", "Suitable"))
+plot(r2)
+
+ex3 <- exactextractr::exact_extract(
+  r2,
+  s,
+  fun = "frac",
+  full_colnames = TRUE,
+  append_cols = "series"
+)
+
+ex3
+```
+
+From this output we can see that only about `r round(ex3[1,3])`% of areas within the Holland series extent polygon have more than 250 Frost Free Days, whereas almost all of the San Joaquin soil extent polygon would meet the growing season requirement.
+
 ### Example: Summarizing MLRA Raster Data with `lattice` graphics
 
 Lattice graphics are useful for summarizing grouped comparisons.