Skip to content

Summary statistics in simulations

Alex Chubaty edited this page Jan 28, 2016 · 4 revisions

Summary statistics in simulations

Often, we want to calculate some summary statistics from an object, such as a the mean value of a map, and plot that as a graph, updated at some interval, so that we can see how that statistic behaves over the simulation.

What's the best way to do this in a SpaDES simulation?

There are two main ways to achieve this for a simulation, both of which are implemented at the module level and differ only in whether you are the author/developer of the module in question.

In the examples below, we will be using a raster of forest age and we will calculate the mean forest age annually in our simulation. As a reminder, the code below shows how to calculate the mean forest age in regular R code (i.e., not directly in a SpaDES simulation).

library(ggplot2)
library(raster)
library(SpaDES)

# create a dummy raster of forest age (based on the init from the randomLandscapes sample module)
inMemory <- TRUE
nx <- 100
ny <- 100
template <- raster(nrows = ny, ncols = nx, xmn = -nx/2, xmx = nx/2, ymn = -ny/2, ymx = ny/2)
speedup <- max(1, nx/5e2)

forestAge <- gaussMap(template, scale = 10, var = 0.1, speedup = speedup, inMemory = inMemory)
forestAge[] <- round(getValues(forestAge),1)*20

Plot(forestAge, new = TRUE)
cellStats(forestAge, "mean")

Note, we are using cellStats() from the raster package, which can efficiently calculate several built-in statistics. Although custom statistics functions can be used with cellStats(), extra care must be taken to make these efficient with large rasters.

Summary statistics for a module that you developed

If you are the author/developer of a module, simply do the following:

  1. Add a new module parameter and output to the module metadata.

    Add a new parameter, summaryInterval, which defines how often the summary statistics will be calculated. Remember, that this interval is based on the time unit of the module, so for a module with timeunit = "year" summary events can be scheduled annually by setting summaryInterval to 1.

    Add a new entry to the outputObjects data.frame in the module metadata at the top of the module code file. This is the object that will store the summary values over the course of the simulation. You can explicitly set this object's name or allow the name to be passed by the user as a global parameter in a simulation.

    defineModule(sim, list(
      # other module metadata not shown
      parameters = rbind(
        # other parameters not shown
        defineParameter("summaryInterval", "integer", 1, NA, NA, "time interval between summarize events")
      ),
      outputObjects = data.frame(
        objectName = "meanForestAge", objectClass = "numeric", other = NA_character_,
        stringsAsFactors = FALSE)
    ))
  2. Add a 'summarize' event.

    Define a new event type and make sure it is scheduled during 'init'. The 'summarize' event will call the new module function moduleNameSummarize() which we will define below. Optionally, the 'summarize' event can reschedule itself (see below) using the new module parameter summaryInterval.

    doEvent.moduleName <- function(sim, eventTime, eventType, debug = FALSE) {
      if (eventType == "init") {
        # do stuff for this event
        sim <- sim$moduleNameInit(sim)
    
        # schedule the next events
        # other event types such as plot and save omitted here
        sim <- scheduleEvent(sim, start(sim), "moduleName", "summarize", .last())
      } else if (eventType == "summarize" ) {
        # do stuff for this event
        sim <- sim$moduleNameSummarize(sim)
    
        # schedule the next event
        sim <- scheduleEvent(sim, time(sim) + params(sim)$moduleName$summaryInterval,
                             "moduleName", "summarize", .last())
      }
      # other event types omitted
      return(invisible(sim))
    }

    Note: the exact name of the 'summarize' event isn't important, as long as it meaningfully conveys what is happening during the event. Because we want to make sure that all other events for a particular time are run before the summary statistic is calculated, we are setting the 'summarize' event priority high using .last() (note that 'save' and 'plot' priorities are typically set even higher using .last()+1 or .last()+2).

  3. Add an new event function that calculates the statistic(s) of interest.

    The 'init' event should also be updated to a) initialize the object which will store the values calculated during the 'summarize' event, and b) create an index object to be used internally to keep track of which element of the statistic vector to update each interval. It's important to pre-allocate the statistic vector to improve code performance.

    ## event functions
    
    moduleNameInit <- function(sim) {
      # all other pieces of this function are omitted
      
      nSamples <- ceiling((end(sim) - start(sim)) / params(sim)$moduleName$summaryInterval)
      sim$meanForestAge <- rep(NA_real_, nSamples)
      sim$meanForestAgeIndex <- 1L
      
      return(sim)
    }    
    
    # other event functions omitted
    
    moduleNameSummarize <- function(sim) {
      sim$meanForestAge[sim$meanForestAgeIndex] <- cellStats(sim$forestAge, "mean")
      sim$meanForestAgeIndex <- sim$meanForestAgeIndex + 1L
    
      # you could Plot the summary here or create a new 'summaryPlot' event
      df <- data.frame(year = 1:length(sim$meanForestAge), meanAge = sim$meanForestAge)
      q <- ggplot(df, aes(x = year, y = meanAge)) +
             geom_point()
      Plot(q)
    }
  4. Update your module's reqdPkgs field in the metadata if you are using additional packages.

Summary statistics for a module not developed by you

In many cases, you will be interested in running simulations using a module developed by someone else. Although you could modify this module per above, adding in the relevant summary outputs, a better approach is to create a new module whose sole purpose it is to summarize the outputs of the other module.

A carefully built summary module can then be reused with multiple other modules, simply by specifying which simulation objects should be summarized.