ZZ_TSA_Fall2023_G1_Midterm.qmd

---
title: "ZZ_TSA_G1_2023Fall_Midterm"
editor: source
params:
  print_sol: false
toc: true
toc-location: left
toc-depth: 6
self-contained: true
---

```{r}
library(fpp3)
library(patchwork)
```

## 1. Import the .csv data and format it as a tsibble (**0 points**)

```{r, include=FALSE}
path <- "ZZ_Datasets/spain_unemployment_format.csv"

unemp <- readr::read_csv(path)

# Convert column "date" to a yearmonth object
unemp <- unemp %>% 
      mutate(date = yearmonth(date)) %>% 
      as_tsibble()

unemp
```

```{r, eval=FALSE}
unemp <- readr::read_csv(file.choose())

# Convert column "date" to a yearmonth object
unemp <- unemp %>% 
      mutate(date = yearmonth(date)) %>% 
      as_tsibble()

unemp
```

## 2. Graphical analysis:

### 2.1 Generate a timeplot of the data, adjusting its grid adequately to signal at least the end of every year.

```{r, include=!params$print_sol}
# YOUR CODE GOES HERE
# DO NOT ADD ADDITIONAL CODE SNIPPETS
```

```{r, include=params$print_sol}
unemp %>% 
  autoplot() +
  
  labs(y = "Unemployed people in Spain", 
       x = "Time (Years)"
       ) +

  scale_x_yearmonth(date_breaks = "1 year",
                    date_minor_breaks = "5 years") +
  
  # Flip x-labels by 90 degrees
  theme(axis.text.x = element_text(angle = 90))
```

### 2.2 Approximate the trend of the graph using an appropriate moving average. Then depict the timeplot again, with both the original time series and the resulting estimate of the trend. Adjust the x-grid so that the end of every year is clear.

Requirements:

1. **Compute the moving average as a classical decomposition would**.
2. Store the result in a new variable within `unemp` called `trend_class`.
3. Plot the time series in one color and the trend in another color.

```{r, include=params$print_sol}
unemp <- 
  unemp %>% 
  mutate(
    `12-MA` = slider::slide_dbl(unemp, mean,
                .before = 5, .after = 6, .complete = TRUE),
    trend_class = slider::slide_dbl(`12-MA`, mean,
                .before = 1, .after = 0, .complete = TRUE)
  )

# Generate plot
unemp %>% 
  
  autoplot(unemp, colour = "gray") +
  
  scale_x_yearmonth(date_breaks = "1 year",
                    date_minor_breaks = "5 years") +
  
  geom_line(aes(y = trend_class), colour = "#D55E00") +

  # Flip x-labels by 90 degrees
  theme(axis.text.x = element_text(angle = 90))
```

### 2.3 Remove `trend_class` from the time series, that is, compute the **detrended time series**. Store it in a new variable within `unemp` called `detrended_class`

**NOTE**: if you failed to compute the trend component in the previous question, you may compute the classical decomposition using `classical_decomposition()` and use the trend resulting from it to detrend the time series.

Requirements:

1. Store the result in a new variable called `detrended_class` within `unemp`
2. Depict the detrended component on a time-plot with a grid that signals the end of every year.

```{r, include=!params$print_sol}
## YOUR CODE GOES HERE
## DO NOT ADD ADDITIONAL CODE SNIPPETS.
```


```{r, include=params$print_sol}
unemp <- 
  unemp %>% 
  mutate(
    detrended_class = unemp - trend_class
  )

unemp %>% 
  
  autoplot(detrended_class) +

  labs(y = "Unemployed people in Spain - detrendedn series", 
       x = "Time (Years)"
       ) +

  scale_x_yearmonth(date_breaks = "1 year",
                    date_minor_breaks = "5 years") +
  
  # Flip x-labels by 90 degrees
  theme(axis.text.x = element_text(angle = 90))
```

### 2.4 Compute the ACF plots indicated below. Also answer the question below:

1. ACF plot of the original time series
2. ACF plot of the detrended time series

Q1: WHAT IS THE DIFFERENCE BETWEEN BOTH PATTERNS? 
Q2: CAN YOU CONCLUDE SOMETHING ABOUT
    THE TIME SERIES FROM THESE GRAPHS?

```{r, include=!params$print_sol}
## YOUR CODE GOES HERE
## DO NOT ADD ADDITIONAL CODE SNIPPETS.
```

------

**YOUR ANSWER TO BOTH Q1 AND Q2 GOES HERE - 60 WORDS MAX**

```{r, include=params$print_sol, eval=FALSE}
In the total time series, the effect of the trend is clearly dominant. We only
see the strong and slowly decreasing autocorrelation resulting from the trend.

In the detrended time series, only the seasonal and remainder are present. Here
we can clearly spot yearly seasonality (spikes at 12, 24, 36...)
```

------

```{r, include=params$print_sol}
unemp %>% 
  ACF(unemp, lag_max = 50) %>% 
  autoplot()

unemp %>% 
  ACF(detrended_class, lag_max = 50) %>% 
  autoplot()
```

# 3. STL Decomposition

Perform an STL decomposition. Specifically, create two stl decompositions:

1. STL decomposition with default arguments. Name the dataframe containing the components `stl_default`.
2. STL decomposition with adjusted parameters. Adjust the parameters you deem necessary to improve the decomposition and name the dataframe containing the components `stl_adjust`.

```{r, include=!params$print_sol}
# YOUR CODE GOES HERE
# DO NOT ADD ADDITINAL CODE SNIPPETS, ALL THE CODE SHALL BE CONTAINED HERE
```

```{r, include=params$print_sol}
stl_default <- 
  unemp %>% 
    model(
      stl = STL(unemp)
    ) %>% 
    components()

stl_adjust <- 
  unemp %>% 
    model(
      stl = STL(unemp ~ trend(window = 7) + season(window = 7))
    ) %>% 
    components()

stl_default %>% autoplot()
stl_adjust %>% autoplot()
```

# 4. Fitting two models

## 4.1 Fitting the models

Fit the models below to the variable `unemp` (that is, to the original time series). Specifically, fit the following models:

1. `decomposition_model()` using the **drift model** for the *seasonally adjusted series* and **seasonal naïve model** for the *seasonal component*. Call this model `dcmp_drift`.
  * For the **stl decomposition, adjust the parameters as you did in exercise 3.**
2. `decomposition_model()` using a **simple exponential smoothing** model for the *seasonally adjusted series* and a **seasonal naïve model** for the *seasonal component*. Call this model `dcmp_stl`.
  * For the **stl decomposition, adjust the parameters as you did in exercise 3.**
  
Store the result of fitting the models in a variable called `fit`

```{r, include=!params$print_sol}
# YOUR CODE GOES HERE
# DO NOT ADD ADDITINAL CODE SNIPPETS, ALL THE CODE SHALL BE CONTAINED HERE
```

```{r, include=params$print_sol}
fit <- 
  unemp %>% 
  model(
    dcmp_drift = decomposition_model(
                        # Decomposition scheme used
                        STL(unemp ~ trend(window = 7) + season(window = 7)),
                        
                        # Model for the seas adjusted component
                        RW(season_adjust ~ drift()),
                        
                        # Model for the seasonal component
                        SNAIVE(season_year)
                  ),
    dcmp_ses = decomposition_model(
                            # Decomposition scheme used
                            STL(unemp ~ trend(window = 7) + season(window = 7)),
                            
                            # Model for the seas adjusted component
                            ETS(season_adjust ~ error("A") + trend("N") + season("N")),
                            
                            # Model for the seasonal component
                            SNAIVE(season_year)
                      )
  )
```

## 4.2 Performing forecasts

Perform 12 step ahead forecasts (1 year) using both models. Store the results in a variable called `fc`.

Then depict the forecasts along with the original time series and the corresponding confidence intervals. Create one figure per model.

```{r, include=!params$print_sol}
# YOUR CODE GOES HERE
# DO NOT ADD ADDITINAL CODE SNIPPETS, ALL THE CODE SHALL BE CONTAINED HERE
```

```{r, include=params$print_sol}
fc <- 
  fit %>% forecast(h=12)

fc %>% 
  filter(.model == "dcmp_drift") %>% 
  
  autoplot(unemp) +
  
  scale_x_yearmonth(date_breaks = "1 year",
                    date_minor_breaks = "5 years") +

  # Flip x-labels by 90 degrees
  theme(axis.text.x = element_text(angle = 90))

fc %>% 
  filter(.model == "dcmp_ses") %>% 
  
  autoplot(unemp) +
  
  scale_x_yearmonth(date_breaks = "1 year",
                    date_minor_breaks = "5 years") +

  # Flip x-labels by 90 degrees
  theme(axis.text.x = element_text(angle = 90))
```

# 4.3 Analyze the residuals of the `dcmp_drift` model.

Analyze **only normality and heteroskedasticity of the residuals** of the model.

**NOTE**: since you have two models in `fit`, to use the `gg_tsresiduals()` command you need to select only one of them first. If you have named them as I indicated, the following code should do the trick. Add additional graphs and tests you deem necessary.

```{r, eval=FALSE}
fit %>% 
  select(dcmp_drift) %>% # Select only dcmp_drift
  gg_tsresiduals()
```

```{r, include=!params$print_sol}
# YOUR CODE GOES HERE
# DO NOT ADD ADDITINAL CODE SNIPPETS, ALL THE CODE SHALL BE CONTAINED HERE
```

------

YOUR ANSWER GOES HERE. MAX 60 WORDS.

------

```{r, include=params$print_sol}
fit %>% 
  select(dcmp_drift) %>% # Select only dcmp_drift
  gg_tsresiduals()

model_vals <- 
  fit %>% 
  augment() %>% 
  filter(.model == "dcmp_drift") # Filter to retain only "dcmp_drift" residuals

# Generate qq_plot
p1 <- ggplot(model_vals, aes(sample = .innov))
p1 <- p1 + stat_qq() + stat_qq_line()

# Generate box-plot
p2 <- ggplot(data = model_vals, aes(y = .innov)) +
      geom_boxplot(fill="light blue", alpha = 0.7) +
      stat_summary(aes(x=0), fun="mean", colour= "red") # Include the mean

p1 + p2

# HETERSKEDASTICITY - looking at the time plot produced by gg_tsresiduals()
# From January 2020 onwards, in the viccinity of the COVID crisis, the heteroskedasticity
# of the residuals increases.

# AUTOCORRELATION
# Clearly the residuals are not white noise. The model does not appropriately
# capture the time structure of the data.

# BIAS
mean(model_vals$.innov, na.rm = TRUE)
# The bias of the residuals is around 44. Considering that we are predicting things
# of the order of magnitude 1E6 (number of unemployed people), we may consider this
# negligible. Nonetheless we could always remove the bias from the forecasts.


# OVERALL THIS IS NOT A GOOD MODEL FOR THE DATA AND WE HAVE FITTED THE MODEL TO
# THE ENTIRETY OF THE TIME SERIES WHEN IN ACTUAL FACT IT IS THE LAST PERIOD, THE
# COVID CRISIS, THAT IS MOST RELEVANT TO PERFORM FORECASTS. WE ARE USING PAST DATA
# TO PREDICT FUTURE DATA AND THE STRUCTURE OF THE TIME SERIES HAS RADICALLY CHANGED
# PREDICTIONS UNTIL SUFFICIENT DURING-COVID AND POST-COVID DATA IS AVAILABLE 
# ARE NOT RELIABLE.

# WHEN SUFFICIENT DURING-COVID AND POST-COVID DATA IS AVAILABLE, ONLY THIS DATA
# SHOULD PROBABLY BE CONSIDERED, OR PERHAPS THE DURING-COVID TIME SHOULD BE REMOVED.
```

# 5. STL decomposition analysis

This points builds on the result of point 3. Consider the two decompositions `stl_default` and `stl_adjust` fitted in point 3.

1. Plot the ACF of the remainder of both decompositions.
2. Compute a single number that measures the amount of autocorrelation left in the remainder of each decomposition.

  * For `stl_default` call this single metric `corr_metric_default`
  * For `stl_adjust` call this single metric `corr_metric_adjust`
  * Check that `corr_metric_adjust` < `corr_metric_default`.
  * Compute `corr_metric_default / corr_metric_adjust` and give an interpretation.

```{r, input=!params$print_sol}
## YOUR CODE GOES HERE
## DO NOT ADD ADDITIONAL CODE SNIPPETS
```

```{r, include=params$print_sol}
acfs_default <- stl_default %>% ACF(remainder)
acfs_adjust <- stl_adjust %>% ACF(remainder)

acfs_default %>% autoplot()
acfs_adjust %>% autoplot()

corr_metric_default <- sum(acfs_default$acf[1:10]^2)
corr_metric_adjust <- sum(acfs_adjust$acf[1:10]^2)

corr_metric_adjust < corr_metric_default

corr_metric_default / corr_metric_adjust
```

```{r, include=params$print_sol, eval=FALSE}
We can see that corr_metric_default / corr_metric_adjust is approximately 4.7.

Roughly speaking, this means that we have reduced by a factor of four the
autocorrelation in the remainder of the decomposition, which is good since this
structure is now captured by the combination of the trend and seasonal components
instead.
```