quantile_ensembles.Rmd

---
title: "Quantile forecasting with ensembles and combinations"
author: Rob J Hyndman
branding: false
bibliography: refs.bib
output: MonashEBSTemplates::memo
numbersections: false
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, cache = TRUE, warning = FALSE, message = FALSE)

library(fpp3)
library(distributional)
set.seed(2020 - 08 - 08)

if (file.exists("cafe.rds")) {
  cafe <- readRDS("cafe.rds")
} else {
  cafe <- readabs::read_abs(series_id = "A3349870V") %>%
    select(date, value) %>%
    mutate(date = yearmonth(date)) %>%
    as_tsibble(index = date) %>%
    filter(
      date >= yearmonth("2006 Jan"),
      date <= yearmonth("2019 Dec")
    )
  saveRDS(cafe, "cafe.rds")
}
```

# Forecasting using possible futures

One way to think about forecasting is that we are describing the possible futures that might occur.

Suppose we are interested in forecasting the total sales in Australian cafés and we train an ETS model and an ARIMA model [@fpp3] on the data to the end of 2018. Then we can simulate sample paths from these models to obtain many possible "futures". Figure \@ref(fig:samples) shows the last four years of training data and 5 futures generated from each of the two fitted models.

```{r samples, echo=FALSE, fig.cap="Future sample paths obtained using an ARIMA model and an ETS model for the Australian monthly café turnover.", fig.height=4, fig.width=7}
train <- cafe %>%
  filter(year(date) <= 2018)
fit <- train %>%
  model(
    ETS = ETS(value),
    ARIMA = ARIMA(value ~ pdq(d = 1) + PDQ(D = 1)),
    SNAIVE = SNAIVE(value)
  )
future <- fit %>%
  select(-SNAIVE) %>%
  generate(times = 5, h = "1 year")
train %>%
  filter(year(date) >= 2015) %>%
  autoplot(value) +
  # geom_line(data = cafe %>% filter(year(date) == 2019))  +
  geom_line(data = future %>% mutate(modrep = paste0(.model, .rep)), aes(y = .sim, col = .model, group = c(modrep))) +
  labs(x = "Month", y = "Turnover (A$million)") +
  guides(colour = guide_legend("Model"))
```

If we repeat this procedure thousands of times for each model, we can obtain a very clear picture of the probability distribution for each future time period. The means of these sample paths are the traditional point forecasts. Traditional 95% prediction intervals are equivalent to finding the middle 95% of the futures at each forecast horizon.

Simulated future sample paths also allow us to answer many more interesting questions. For example, we may wish to find prediction intervals for the total turnover for the next 12 months. This is surprisingly difficult to handle analytically but trivial using simulations --- we just need to add up the turnover for each of the simulated sample paths, and then compute the relevant percentiles. We might also want to forecast the maximum turnover in any month over the next year. Again, that is a difficult problem analytically, but very easy using simulations. I expect that simulating future sample paths will play an increasingly important role in forecasting practice because it makes difficult problems relatively easy, and allows us to explore what the future might be like in ways that would otherwise be almost impossible.

Using simulations in forecasting requires a generative statistical model to be used. This is easy using an ARIMA or ETS model, but more difficult if something like a neural network or random forest has been used.

# Quantile forecasting

Almost everyone needs probabilistic forecasts whether they realise it or not. Without some kind of probabilistic forecast or other measure of uncertainty, a point forecast is largely useless as there is no way of knowing how wrong it is likely to be. A simple version of a probabilistic forecast is a prediction interval which is intended to cover the true value with a specified probability. Another type of probabilistic forecast is the notion of "safety stock", which is the additional stock to be ordered above the point forecast in order to meet demand with a specified probability.

A more sophisticated way of producing probabilistic forecasts is to generate quantile forecasts. For example, a 90% quantile forecast is a value which should exceed the true observation 90% of the time, and be less than the true value 10% of the time. Median forecasts are equivalent to 50% quantile forecasts. Prediction intervals are often constructed in this way --- an 80% prediction interval can be based on the 10% and 90% quantile forecasts. Safety stock can also be computed from quantile forecasts --- set the stock order to be the 95% quantile to ensure your probability of being out-of-stock is 5%.

Any statistical forecasting method can be used to produce quantile forecasts by simulation. We simply need to compute the quantiles at each time from the simulated sample paths. Figure \@ref(fig:quantiles) shows the deciles for the ETS forecasts (i.e., the 10th, 20th, \dots, 90th percentiles).

```{r quantiles, dependson='samples', fig.cap="Blue: Deciles for the ETS forecasts for the Australian monthly café turnover. Black: Observed values.", fig.height=4, fig.width=6}
qf <- fit %>%
  select(ETS) %>%
  generate(times = 1000, h = "1 year") %>%
  as_tibble() %>%
  group_by(date) %>%
  summarise(
    qs = quantile(.sim, seq(from = 0.1, to = 0.9, by = 0.1)), prob = seq(from = 0.1, to = 0.9, by = 0.1)
  )
qf %>%
  ggplot(aes(x = date)) +
  geom_line(aes(y = qs, group = prob), col = "blue", alpha = 0.5) +
  geom_line(aes(y = value), data = cafe %>% filter(year(date) == 2019)) +
  geom_line(aes(y = value), data = train %>% filter(year(date) >= 2015)) +
  labs(x = "Month", y = "Turnover (A$million)")
```

# Evaluating quantile forecasts

Most business doing forecasting will be familiar with computing accuracy measures for point forecasts such as MAPE or RMSE values. With quantile forecasts, we need to use some alternative measures.

Quantile scores provides a measure of accuracy for each quantile of interest. Suppose we are interested in the quantile forecast with probability $p$ at future time $t$, and let this be denoted by $f_{p,t}$. That is, we expect the observation at time $t$ to be less than $f_{p,t}$ with probability $p$. For example, an estimate of the 95th percentile would be $f_{0.95,t}$. If $y_{t}$ denotes the observation at time $t$, then the quantile score is
$$
  Q_{p,t} = \begin{cases}
  2(1 - p) \big(f_{p,t} - y_{t}\big), & \text{if $y_{t} < f_{p,t}$}\\
  2p \big(y_{t} - f_{p,t}\big), & \text{if $y_{t} \ge f_{p,t}$} \end{cases}
$$
This is sometimes called the "pinball loss function" because a graph of it resembles the trajectory of a ball on a pinball table. The multiplier of 2 is often omitted, but including it makes the interpretation a little easier. A low value of $Q_p$ indicates a better estimate of the quantile.

```{r qp, dependson='quantiles'}
fcast <- qf %>%
  filter(prob == 0.9, date == yearmonth("2019 Dec")) %>%
  pull(qs)
actual <- cafe %>%
  filter(date == yearmonth("2019 Dec")) %>%
  pull(value)
```

In Figure \@ref(fig:quantiles), the 90% quantile forecast for December 2019 is $f_{0.9,t} = `r round(fcast)`$ and the observed value is $y_t = `r round(actual)`$. Then
$Q_{0.9,t} = 2(1-0.9) (`r round(fcast)` - `r round(actual)`) = `r round(2*(1-0.9) *(fcast - actual))`$.

The quantile score can be interpreted like an absolute error. In fact, when $p=0.5$, the quantile score $Q_{0.5,t}$ is the same as the absolute error. For other values of $p$, the "error" $(y_t - f_{p,t})$ is weighted to take account of how likely it is be positive or negative. If $p>0.5$, $Q_{p,t}$ gives a heavier penalty when the observation is greater than the estimated quantile than when the observation is less than the estimated quantile. The reverse is true for $p<0.5$.

Often we are interested in the whole forecasting distribution (not just a few quantiles), and then we can average the quantile scores over all values of $p$. This gives what is known as the "Continuous Ranked Probability Score" or CRPS [@Gneiting2014-je].

In the Australian café example, we can compute the CRPS values over the 12 months of 2019 for each of the ARIMA and ETS models. To make it more interpretable, we can also compute the CRPS for a simple seasonal naive model, and then we can calculate the "skill score" equal to the percentage improvement for ARIMA and ETS over seasonal naive.

```{r crps, dependson='samples'}
fcasts <- fit %>%
  forecast(h = "1 year")
crps <- fcasts %>%
  accuracy(cafe, measures = list(CRPS = CRPS))
snaive_crps <- crps %>%
  filter(.model == "SNAIVE") %>%
  pull(CRPS)
crps <- crps %>%
  mutate(skillscore = 100 * (1 - CRPS / snaive_crps))
crps %>%
  mutate(
    CRPS = sprintf("%.1f", CRPS),
    skillscore = sprintf("%.1f", skillscore)
  ) %>%
  select(-.type) %>%
  rename(
    Model = .model,
    `Skill score` = skillscore
  ) %>%
  filter(Model != "ensemble") %>%
  arrange(`Skill score`) %>%
  knitr::kable(booktabs = TRUE, align = c("l", "r", "r"))
```

Here, ETS is providing the best quantile forecasts with a skill score of `r sprintf("%.1f", crps %>% filter(.model=="ETS") %>% pull(skillscore))`.

# Ensemble forecasting

Ensemble forecasting involves using multiple models and combining the future sample paths to produce the final forecast. If a weighted ensemble is needed, we can make the number of simulations from each model correspond to the required weight.

Ensemble forecasting has been used in weather forecasting for many years, but is not so widespread in other domains. The logic behind ensemble forecasting is that no model is perfect, and the data did not come from a model. As George Box has put it, “all models are wrong, but some are useful” [@box1976science]. Ensembles allow the good features of various models to be included, while reducing the impact of any specific model. It also allows the uncertainty associated with selecting a model to be incorporated into the quantile forecasts.

For the Australian café data, we can combine 10000 simulated sample paths from each of the ETS and ARIMA models, and compute the resulting quantile forecasts from the 20000 sample paths.

```{r ensemble, dependson='samples'}
ensemble <- fit %>%
  select(-SNAIVE) %>%
  generate(times = 10000, h = "1 year") %>%
  summarise(
    value = dist_sample(list(.sim)),
    .mean = mean(value)
  ) %>%
  mutate(
    .model = "ENSEMBLE"
  ) %>% 
  as_fable(distribution = value, response = "value")
ensemble %>%
  accuracy(cafe, measures = list(CRPS = CRPS)) %>%
  mutate(
    skillscore = 100 * (1 - CRPS / snaive_crps),
    CRPS = sprintf("%.1f", CRPS),
    skillscore = sprintf("%.1f", skillscore)
  ) %>%
  select(-.type) %>%
  rename(
    Model = .model,
    `Skill score` = skillscore
  ) %>%
  knitr::kable(booktabs = TRUE, align = c("l", "r", "r"))
```

The ensemble forecasts are slightly better than either the ETS and ARIMA forecasts in this case. When the component models use very different information, the benefit of using ensemble forecasts is greater.

# Combination forecasting

Combination forecasting is a related idea that is more widely used in the general forecasting community. This involves taking a weighted average of the forecasts produced from the component models. Often a simple average is used. For more than 50 years we have known that combination forecasting improves forecast accuracy [@Bates1969-dp;@Clemen1989-fz]. One of the reasons for this is that the combination decreases the variance of the forecasts [@Hibon2005-cv] by reducing the uncertainty associated with selecting a particular model.

Combinations are almost always used to produce point forecasts, not probabilistic forecasts. A weighted average of several component forecasts gives a point forecast that is identical to taking the mean of the sample paths from the corresponding weighted ensemble.

However, the idea can be used more generally to obtain quantile forecasts as well. Quantiles can not simply be averaged, so we need to take account of the correlations between the forecast errors from the component models when producing quantile forecasts. This is implemented in the `fable` package for R. For the Australian café data, this gives the following result.

```{r combinations}
fit %>%
  mutate(COMBINATION = (ETS + ARIMA) / 2) %>%
  forecast(h = "1 year") %>%
  filter(.model == "COMBINATION") %>%
  accuracy(cafe, measures = list(CRPS = CRPS)) %>%
  mutate(
    skillscore = 100 * (1 - CRPS / snaive_crps),
    CRPS = sprintf("%.1f", CRPS),
    skillscore = sprintf("%.1f", skillscore)
  ) %>%
  select(-.type) %>%
  rename(
    Model = .model,
    `Skill score` = skillscore
  ) %>%
  knitr::kable(booktabs = TRUE, align = c("l", "r", "r"))
```

Further improvement has been obtained by taking account of the similarity of the ETS and ARIMA forecasts, rather than simply combining the sample paths as with ensemble forecasting.

# Conclusions

I have described several tools for forecasting that are likely to be increasingly used in business forecasting in the future.

 * Simulated future sample paths allow us to study how the future might evolve, and allow us to answer more complicated forecasting questions than is possible with analytical methods.
 * Quantile forecasts can be produced from these simulated future sample paths and provide a way of quantifying the forecast distributions.
 * Quantile scores allow us to evaluate quantile forecasts. Averaging quantile scores gives the CRPS which allows us to evaluate the whole forecast distribution.
 * Forecast ensembles combine information from multiple models and often provide a better estimate of future uncertainty than any individual model.
 * Forecast combinations are similar to ensembles but also take account of the relationships between the component models. The best forecasts often come from combining models in this way.


# Supplements

All the forecasts and calculations produced in this chapter were obtained with the `fable` package for R. The code used is available at https://github.com/robjhyndman/quantile_ensembles.