-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathZZ_TSA_Fall2023_G1_Midterm.qmd
397 lines (287 loc) · 11.9 KB
/
ZZ_TSA_Fall2023_G1_Midterm.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
---
title: "ZZ_TSA_G1_2023Fall_Midterm"
editor: source
params:
print_sol: false
toc: true
toc-location: left
toc-depth: 6
self-contained: true
---
```{r}
library(fpp3)
library(patchwork)
```
## 1. Import the .csv data and format it as a tsibble (**0 points**)
```{r, include=FALSE}
path <- "ZZ_Datasets/spain_unemployment_format.csv"
unemp <- readr::read_csv(path)
# Convert column "date" to a yearmonth object
unemp <- unemp %>%
mutate(date = yearmonth(date)) %>%
as_tsibble()
unemp
```
```{r, eval=FALSE}
unemp <- readr::read_csv(file.choose())
# Convert column "date" to a yearmonth object
unemp <- unemp %>%
mutate(date = yearmonth(date)) %>%
as_tsibble()
unemp
```
## 2. Graphical analysis:
### 2.1 Generate a timeplot of the data, adjusting its grid adequately to signal at least the end of every year.
```{r, include=!params$print_sol}
# YOUR CODE GOES HERE
# DO NOT ADD ADDITIONAL CODE SNIPPETS
```
```{r, include=params$print_sol}
unemp %>%
autoplot() +
labs(y = "Unemployed people in Spain",
x = "Time (Years)"
) +
scale_x_yearmonth(date_breaks = "1 year",
date_minor_breaks = "5 years") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))
```
### 2.2 Approximate the trend of the graph using an appropriate moving average. Then depict the timeplot again, with both the original time series and the resulting estimate of the trend. Adjust the x-grid so that the end of every year is clear.
Requirements:
1. **Compute the moving average as a classical decomposition would**.
2. Store the result in a new variable within `unemp` called `trend_class`.
3. Plot the time series in one color and the trend in another color.
```{r, include=params$print_sol}
unemp <-
unemp %>%
mutate(
`12-MA` = slider::slide_dbl(unemp, mean,
.before = 5, .after = 6, .complete = TRUE),
trend_class = slider::slide_dbl(`12-MA`, mean,
.before = 1, .after = 0, .complete = TRUE)
)
# Generate plot
unemp %>%
autoplot(unemp, colour = "gray") +
scale_x_yearmonth(date_breaks = "1 year",
date_minor_breaks = "5 years") +
geom_line(aes(y = trend_class), colour = "#D55E00") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))
```
### 2.3 Remove `trend_class` from the time series, that is, compute the **detrended time series**. Store it in a new variable within `unemp` called `detrended_class`
**NOTE**: if you failed to compute the trend component in the previous question, you may compute the classical decomposition using `classical_decomposition()` and use the trend resulting from it to detrend the time series.
Requirements:
1. Store the result in a new variable called `detrended_class` within `unemp`
2. Depict the detrended component on a time-plot with a grid that signals the end of every year.
```{r, include=!params$print_sol}
## YOUR CODE GOES HERE
## DO NOT ADD ADDITIONAL CODE SNIPPETS.
```
```{r, include=params$print_sol}
unemp <-
unemp %>%
mutate(
detrended_class = unemp - trend_class
)
unemp %>%
autoplot(detrended_class) +
labs(y = "Unemployed people in Spain - detrendedn series",
x = "Time (Years)"
) +
scale_x_yearmonth(date_breaks = "1 year",
date_minor_breaks = "5 years") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))
```
### 2.4 Compute the ACF plots indicated below. Also answer the question below:
1. ACF plot of the original time series
2. ACF plot of the detrended time series
Q1: WHAT IS THE DIFFERENCE BETWEEN BOTH PATTERNS?
Q2: CAN YOU CONCLUDE SOMETHING ABOUT
THE TIME SERIES FROM THESE GRAPHS?
```{r, include=!params$print_sol}
## YOUR CODE GOES HERE
## DO NOT ADD ADDITIONAL CODE SNIPPETS.
```
------
**YOUR ANSWER TO BOTH Q1 AND Q2 GOES HERE - 60 WORDS MAX**
```{r, include=params$print_sol, eval=FALSE}
In the total time series, the effect of the trend is clearly dominant. We only
see the strong and slowly decreasing autocorrelation resulting from the trend.
In the detrended time series, only the seasonal and remainder are present. Here
we can clearly spot yearly seasonality (spikes at 12, 24, 36...)
```
------
```{r, include=params$print_sol}
unemp %>%
ACF(unemp, lag_max = 50) %>%
autoplot()
unemp %>%
ACF(detrended_class, lag_max = 50) %>%
autoplot()
```
# 3. STL Decomposition
Perform an STL decomposition. Specifically, create two stl decompositions:
1. STL decomposition with default arguments. Name the dataframe containing the components `stl_default`.
2. STL decomposition with adjusted parameters. Adjust the parameters you deem necessary to improve the decomposition and name the dataframe containing the components `stl_adjust`.
```{r, include=!params$print_sol}
# YOUR CODE GOES HERE
# DO NOT ADD ADDITINAL CODE SNIPPETS, ALL THE CODE SHALL BE CONTAINED HERE
```
```{r, include=params$print_sol}
stl_default <-
unemp %>%
model(
stl = STL(unemp)
) %>%
components()
stl_adjust <-
unemp %>%
model(
stl = STL(unemp ~ trend(window = 7) + season(window = 7))
) %>%
components()
stl_default %>% autoplot()
stl_adjust %>% autoplot()
```
# 4. Fitting two models
## 4.1 Fitting the models
Fit the models below to the variable `unemp` (that is, to the original time series). Specifically, fit the following models:
1. `decomposition_model()` using the **drift model** for the *seasonally adjusted series* and **seasonal naïve model** for the *seasonal component*. Call this model `dcmp_drift`.
* For the **stl decomposition, adjust the parameters as you did in exercise 3.**
2. `decomposition_model()` using a **simple exponential smoothing** model for the *seasonally adjusted series* and a **seasonal naïve model** for the *seasonal component*. Call this model `dcmp_stl`.
* For the **stl decomposition, adjust the parameters as you did in exercise 3.**
Store the result of fitting the models in a variable called `fit`
```{r, include=!params$print_sol}
# YOUR CODE GOES HERE
# DO NOT ADD ADDITINAL CODE SNIPPETS, ALL THE CODE SHALL BE CONTAINED HERE
```
```{r, include=params$print_sol}
fit <-
unemp %>%
model(
dcmp_drift = decomposition_model(
# Decomposition scheme used
STL(unemp ~ trend(window = 7) + season(window = 7)),
# Model for the seas adjusted component
RW(season_adjust ~ drift()),
# Model for the seasonal component
SNAIVE(season_year)
),
dcmp_ses = decomposition_model(
# Decomposition scheme used
STL(unemp ~ trend(window = 7) + season(window = 7)),
# Model for the seas adjusted component
ETS(season_adjust ~ error("A") + trend("N") + season("N")),
# Model for the seasonal component
SNAIVE(season_year)
)
)
```
## 4.2 Performing forecasts
Perform 12 step ahead forecasts (1 year) using both models. Store the results in a variable called `fc`.
Then depict the forecasts along with the original time series and the corresponding confidence intervals. Create one figure per model.
```{r, include=!params$print_sol}
# YOUR CODE GOES HERE
# DO NOT ADD ADDITINAL CODE SNIPPETS, ALL THE CODE SHALL BE CONTAINED HERE
```
```{r, include=params$print_sol}
fc <-
fit %>% forecast(h=12)
fc %>%
filter(.model == "dcmp_drift") %>%
autoplot(unemp) +
scale_x_yearmonth(date_breaks = "1 year",
date_minor_breaks = "5 years") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))
fc %>%
filter(.model == "dcmp_ses") %>%
autoplot(unemp) +
scale_x_yearmonth(date_breaks = "1 year",
date_minor_breaks = "5 years") +
# Flip x-labels by 90 degrees
theme(axis.text.x = element_text(angle = 90))
```
# 4.3 Analyze the residuals of the `dcmp_drift` model.
Analyze **only normality and heteroskedasticity of the residuals** of the model.
**NOTE**: since you have two models in `fit`, to use the `gg_tsresiduals()` command you need to select only one of them first. If you have named them as I indicated, the following code should do the trick. Add additional graphs and tests you deem necessary.
```{r, eval=FALSE}
fit %>%
select(dcmp_drift) %>% # Select only dcmp_drift
gg_tsresiduals()
```
```{r, include=!params$print_sol}
# YOUR CODE GOES HERE
# DO NOT ADD ADDITINAL CODE SNIPPETS, ALL THE CODE SHALL BE CONTAINED HERE
```
------
YOUR ANSWER GOES HERE. MAX 60 WORDS.
------
```{r, include=params$print_sol}
fit %>%
select(dcmp_drift) %>% # Select only dcmp_drift
gg_tsresiduals()
model_vals <-
fit %>%
augment() %>%
filter(.model == "dcmp_drift") # Filter to retain only "dcmp_drift" residuals
# Generate qq_plot
p1 <- ggplot(model_vals, aes(sample = .innov))
p1 <- p1 + stat_qq() + stat_qq_line()
# Generate box-plot
p2 <- ggplot(data = model_vals, aes(y = .innov)) +
geom_boxplot(fill="light blue", alpha = 0.7) +
stat_summary(aes(x=0), fun="mean", colour= "red") # Include the mean
p1 + p2
# HETERSKEDASTICITY - looking at the time plot produced by gg_tsresiduals()
# From January 2020 onwards, in the viccinity of the COVID crisis, the heteroskedasticity
# of the residuals increases.
# AUTOCORRELATION
# Clearly the residuals are not white noise. The model does not appropriately
# capture the time structure of the data.
# BIAS
mean(model_vals$.innov, na.rm = TRUE)
# The bias of the residuals is around 44. Considering that we are predicting things
# of the order of magnitude 1E6 (number of unemployed people), we may consider this
# negligible. Nonetheless we could always remove the bias from the forecasts.
# OVERALL THIS IS NOT A GOOD MODEL FOR THE DATA AND WE HAVE FITTED THE MODEL TO
# THE ENTIRETY OF THE TIME SERIES WHEN IN ACTUAL FACT IT IS THE LAST PERIOD, THE
# COVID CRISIS, THAT IS MOST RELEVANT TO PERFORM FORECASTS. WE ARE USING PAST DATA
# TO PREDICT FUTURE DATA AND THE STRUCTURE OF THE TIME SERIES HAS RADICALLY CHANGED
# PREDICTIONS UNTIL SUFFICIENT DURING-COVID AND POST-COVID DATA IS AVAILABLE
# ARE NOT RELIABLE.
# WHEN SUFFICIENT DURING-COVID AND POST-COVID DATA IS AVAILABLE, ONLY THIS DATA
# SHOULD PROBABLY BE CONSIDERED, OR PERHAPS THE DURING-COVID TIME SHOULD BE REMOVED.
```
# 5. STL decomposition analysis
This points builds on the result of point 3. Consider the two decompositions `stl_default` and `stl_adjust` fitted in point 3.
1. Plot the ACF of the remainder of both decompositions.
2. Compute a single number that measures the amount of autocorrelation left in the remainder of each decomposition.
* For `stl_default` call this single metric `corr_metric_default`
* For `stl_adjust` call this single metric `corr_metric_adjust`
* Check that `corr_metric_adjust` < `corr_metric_default`.
* Compute `corr_metric_default / corr_metric_adjust` and give an interpretation.
```{r, input=!params$print_sol}
## YOUR CODE GOES HERE
## DO NOT ADD ADDITIONAL CODE SNIPPETS
```
```{r, include=params$print_sol}
acfs_default <- stl_default %>% ACF(remainder)
acfs_adjust <- stl_adjust %>% ACF(remainder)
acfs_default %>% autoplot()
acfs_adjust %>% autoplot()
corr_metric_default <- sum(acfs_default$acf[1:10]^2)
corr_metric_adjust <- sum(acfs_adjust$acf[1:10]^2)
corr_metric_adjust < corr_metric_default
corr_metric_default / corr_metric_adjust
```
```{r, include=params$print_sol, eval=FALSE}
We can see that corr_metric_default / corr_metric_adjust is approximately 4.7.
Roughly speaking, this means that we have reduced by a factor of four the
autocorrelation in the remainder of the decomposition, which is good since this
structure is now captured by the combination of the trend and seasonal components
instead.
```