01-TimeSeries.Rmd

# Where to start {#where}

R has been used by statisticians and data scientists for years, but it is rapidly becoming an essential tool in business intelligence as well. Creating SPC charts can take hours in Excel or Tableau (and can be quite error-prone), but they can be created in seconds with a line or two of R code.   

## R packages

This book uses the following R packages:  

<br>  

```{r loadpackages, message=FALSE, warning=FALSE, eval=FALSE}
library(ggplot2)     # for general plotting
library(lubridate)   # for easier date/time casting
library(forecast)    # for plotting and forecasts
library(qicharts2)   # for simple run charts and control charts
library(seasonal)    # for seasonal adjusment calculations
library(ggseas)      # for on-the-fly seasonal adjustment plotting
library(ggExtra)     # for making line+histogram marginal plots
library(gridExtra)   # for creating multi-graph plots
```

We'll use fake data throughout this book; below are some data sets that we'll use in this chapter and in a few places later:  

<br>  

```{r make_data}
# Create fake process data
set.seed(250)
df = data.frame(Subgroup = seq(as.Date("2006-01-01"), by = "month", length.out = 120),
                Value = 18 + rnorm(120))

# Create a time series (`ts`) object from the df data
# (We'll use this later in the chapter)
df_ts = ts(df$Value, start = c(2006,01), frequency = 12)

# Create fake process data with an upward trend
set.seed(81)
n = 36
x = seq(1:n) 
mb = data.frame(Subgroup = seq(as.Date("2006-01-01"), by = "month", length.out = n), 
            Value = 10000 + (seq(1:n) * 1.25) + (rnorm(n, 0, 5)))

# Create a time series object from the mb data
mb_ts = ts(mb$Value, start = c(2006,01), frequency = 12)
```


## Start with basic EDA {#histoline}

***Before anything else***, plot your data as a line chart and a histogram (adding a density overlay provides a more "objective" sense of the distribution).  

<br>  

```{r plot_it_first_line, fig.height=3}
# Line plot with loess smoother for assessing trend
p1 = ggplot(df, aes(x = Subgroup, y = Value)) + 
  geom_smooth() + 
  geom_line() 

# Histogram with density overlay
p2 = ggplot(df, aes(Value)) + 
  geom_histogram(aes(y = ..density..), binwidth = 0.5, color = "gray95") +
  geom_density(color = "blue")

grid.arrange(p1, p2, widths = c(0.65, 0.35))
```

In these plots, consider:  

- The shape of the distribution: symmetrical/skewed, uniform/peaked/multimodal, whether changes in binwidth show patterning, etc.     
- Whether you see any trending, cycles, or suggestions of autocorrelation.     
- Whether there are any obvious outliers or inliers---basically, any points deviating form the expected pattern.     

## Testing assumptions {#testassumptions}

### Trending

You can test whether a process is trending first by eye: does it look like it's trending over a large span of the time series? Then it probably is.  

We don't see that in the above example, in fact, it's really close to entirely flat: a very stable process in spite of the noise. For comparison, below is an example of assessing the trend on data that is actually trending.  

<br>  

```{r trendtest}
# Plot trending data
ggplot(mb, aes(x = Subgroup, y = Value)) + 
  geom_smooth() + 
  geom_line(color = "gray70") +
  geom_point() 
```

The Mann-Kendall trend test is often used as well, a non-parametric test that can determine whether the series contains a monotonic trend, whether linear or not.    

<br>  

```{r mktest}
# Use the trend package's Mann-Kendall trend test
trend::mk.test(mb_ts)
```

Because trends can be an indication of special cause variation in a stable process, standard control limits don't make sense around long-trending data, and calculation of center lines and control limits will be incorrect. **Thus, any SPC tests for special causes other than trending will *also* be invalid over long-trending data.** Use a run chart with a median slope instead, e.g., via quantile regression (as seen in [Chapter 6](#runtrend)).  

### Independence and autocorrelation

For either run charts or control charts, the data points must be independent for the guidelines to be effective. The first test of that is conceptual---do you expect that one value in this series will influence a subsequent value? For example, the incidence of some hospital-acquired infections can be the result of previous infections. Suppose one happens at the end of March and another happens at the start of April in the same unit, caused by the same organism---you might suspect that the monthly values would not be independent. 

After considering the context, a second way to assess independence is by calculating the autocorrelation function (acf) for the time series. Autocorrelation values over 0.50 generally indicate problems, as do patterns in the autocorrelation function (described in [Chapter 8](#timedep)). However, *any* significant autocorrelation should be considered carefully relative to the cost of potential false positive or false negative signals. Autocorrelation means that the run chart and control chart interpretation guidelines will be wrong.

For control charts, autocorrelated data will result in control limits that are too small. Data with seasonality (predictable up-and-down patterns) or cycles (irregular up-and-down patterns) will have control limits that are too large. There are diagnostic plots and patterns that help identify each, but the best test is "what does it look like?" If the trend seems to be going up and down, and the control limits don't, it's probably wrong.

Using the `forecast` package's `ggtsdisplay` provides a view of the time series along with the acf and spectral frequency (where frequency is the reciprocal of the time period). Significant autocorrelation is present if there are bars that transgress the blue dashed line in the ACF plot (bottom left). Cycles or seasonality are present if you see a clear peak (or peaks) in the spectrum plot (bottom right).  

<br>  

```{r ts_eda}
# Plot series, acf plot, and spectal plot
ggtsdisplay(df_ts, plot.type = "spectrum") 
```

These plots show that there is no autocorrelation or seasonality/cyclical patterns in the data: there are no obvious patterns nor any bars that cross the blue lines in the acf plot (bottom left), and  there are no peaks in the spectral density plot (bottom right). See [Chapter 8](#timedep) for what these plots can look like when you have time-dependent or otherwise autocorrelated data.

When you do have such data, you cannot use standard SPC tools. Generalized additive models (GAMs or GAMMs) can be useful alternatives; see [Chapter 13](#useful) for some good initial references.  

<br>

***  

```{r warn_pic, echo=FALSE, fig.align="center"}
knitr::include_graphics("images/tipicon.png")
```

**Understanding your data is a fundamental prerequisite of SPC work. Do *not* move on to SPC work until you have explored your data using the techniques demonstrated above and fully understand whether the data are suitable for SPC tools.**

***