-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path03-model-performance.qmd
275 lines (173 loc) · 12.3 KB
/
03-model-performance.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
---
format:
html:
number-depth: 3
css: summary-format.css
---
# Understanding model performance
## Reducible and irreducible error
The goal when we are analyzing data is to find a function that based on some Predictors and some random noise could explain the Response variable.
$$
Y = f(X) + \epsilon
$$
**$\epsilon$** represent the **random error** and correspond to the **irreducible error** as it cannot be predicted using the Predictors in regression models. It would have a mean of 0 unless are missing some relevant Predictors.
In classification models, the **irreducible error** is represented by the **Bayes Error Rate**.
$$
1 - E\left(
\underset{j}{max}Pr(Y = j|X)
\right)
$$
An error is **reducible** if we can improve the accuracy of $\hat{f}$ by using a most appropriate statistical learning technique to estimate $f$.
The challenge to achieve that goal it's that we don't at the beginning how much of the error correspond to each type.
$$
\begin{split}
E(Y-\hat{Y})^2 & = E[f(X) + \epsilon - \hat{f}(X)]^2 \\
& = \underbrace{[f(X)- \hat{f}(X)]^2}_\text{Reducible} +
\underbrace{Var(\epsilon)}_\text{Irredicible}
\end{split}
$$
The reducible error can be also spitted in two parts:
- **Variance** refers to the amount by which $\hat{f}$ would change if we estimate it using a different **training data set**. If a method has high variance then small changes in the training data can result in large changes of $\hat{f}$.
- **Squared bias** refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model as for example a linear model.*Bias* is the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict.
$$
E(y_{0} - \hat{f}(x_{0}))^2 =
Var(\hat{f}(x_{0})) +
[Bias(\hat{f}(x_{0}))]^2 +
Var(\epsilon)
$$
::: {.callout-note}
Our challenge lies in finding a method for which both the variance and the squared bias are low.
:::
## Types of models
- **Parametric methods**
1. Make an assumption about the functional form. For example, assuming linearity.
2. Estimate a small number parameters based on training data.
3. Are easy to interpret.
4. Tend to outperform non-parametric approaches when there is a small number of observations per predictor.
- **Non-parametric methods**
1. Don't make an assumption about the functional form, to accurately fit a wider range of possible shapes for $f$.
2. Need a large number of observations in order to obtain an accurate estimate for $f$.
3. The data analyst must select a level of smoothness (degrees of freedom).
![](img/01-accuracy-vs-interpretability.png){fig-align="center"}
## Evaluating Model Performance
Evaluating the performance of a model is a crucial step in the machine learning process. This involves splitting the available data into two distinct parts:
- **Training Data**: This is the data that we use to "train" or "fit" the model. The model learns from this data, which includes both the input features and the corresponding target values.
- **Test Data**: This is the data that we use to evaluate how well our model generalizes to new, unseen data. It's important to note that this data is not used during the training phase.
Comparing the performance metrics on **both datasets** allows us to assess whether our model is **too flexible** (overfitting) or **too simple** (underfitting) for the given task.
To mitigate these issues, we can use techniques such as **regularization** and **hyperparameter tuning**. These techniques help improve the model's ability to generalize, leading to more robust and reliable predictions on unseen data.
### Regression models
- **Ground Truth vs. Predicts Plot**: In this plot we represent the *predictions on the x-axis* and the *outcome on the y-axis* in a **scatted plot** with a line with **slope = 1**. If the model predicted perfectly, all the points would be along this line. If you see regions where the points are entirely above or below the line it demonstrates that **the errors are correlated with the value of the outcome**. Some possible problems could be:
- Don't having all the important variables in your model
- You need an algorithm that can find more complex relationships in the data
![](img/69-ground-truth-vs-predictions-plot.png){fig-align="center"}
- **Residual Plots**: It plots the residuals ($y_i - \hat{y}_i$) against the predictions ($\hat{y}_i$).In a model with no systematic errors, the errors will be evenly distributed between positive and negative, but When there are systematic errors, there will be **clusters of all positive or all negative residuals**.
![](img/70-residual-plots.png){fig-align="center"}
- **Gain curve plot**: It is useful when **sorting the instances** is more important than predicting the exact outcome value like happens when **predicting probabilities**, where:
- The **x-axis** shows the **fraction of total houses** as sorted by the model.
- The **y-axis** shows the fraction of total accumulated outcome magnitude.
- The **diagonal line** represents the gain curve if the houses are **sorted randomly**. T
- The **wizard curve** represents what a **perfect model** would trace out.
- The blue curve is what **our model** traces out.
- A **relative gini coefficient** close to one shows that the model correctly sorts high unemployment situations from lower ones.
```{r}
#| code-fold: true
# Creating a data.frame
set.seed(34903490)
y = abs(rnorm(20)) + 0.1
x = abs(y + 0.5*rnorm(20))
frm = data.frame(model=x, value=y)
# Getting the predicted top 25% most valuable points
# as sorted by the model
gainx = 0.25
# Creating a function to calculate the label for the annotated point
labelfun = function (gx, gy) {
pctx = gx*100
pcty = gy*100
paste("The predicted top ", pctx, "% most valuable points by the model\n",
"are ", pcty, "% of total actual value", sep='')
}
WVPlots::GainCurvePlotWithNotation(
frm,
xvar = "model",
truthVar = "value",
title = "Example Gain Curve with annotation",
gainx = gainx,
labelfun = labelfun
)
```
- **Test Root Mean Squared Error (RMSE)**: It takes the square root of the MSE metric so that your error is in the same units as your response variable.**Objective: minimize**
To know if the **RMSE of our model** is high or low can compare it with the **standard deviation of the outcome** (`sd(y)`), then if the *RMSE being smaller than the standard deviation* that demonstrates that *the model tends to estimate better than simply taking the average*.
$$
RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}
$$
- **Test Root Mean Squared Relative Error (RMSRE)**: It takes in consideration the magnitude of each prediction so low different a important for low values but doesn't matter to much for high values. This error goes lower after logging the outcome variable.**Objective: minimize**.
$$
RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n \left( \frac{y_i - \hat{y}_i}{y_i} \right)^2}
$$
To know if the **RMSE of our model** is high or low can compare it with the **standard deviation of the outcome** (`sd(y)`), then if the *RMSE being smaller than the standard deviation* that demonstrates that *the model tends to estimate better than simply taking the average*.
$$
RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}
$$
- **Mean squared error (MSE)**: The squared component results in larger errors having larger penalties. **Objective: minimize**
$$
MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$
![](img/01-Training-vs-Test-Error.png){fig-align="center"}
- $R^2$: This is a popular metric that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). But if have too many limitations and You should not place too much emphasis on this metric. **Objective: maximize**
$$
R^2 = 1 - \frac{\overbrace{\sum_{i = 1}^n(y_i - \hat{y}_i)^2}^\text{residual sum of squares}}
{\underbrace{\sum_{i = 1}^n(y_i - \overline{y}_i)^2}_\text{total sum of squares}}
$$
- **Root mean squared logarithmic error (RMSLE)**: When your response variable has a wide range of values, large response values with large errors can dominate the MSE/RMSE metric. RMSLE minimizes this impact so that small response values with large errors can have just as meaningful of an impact as large response values with large errors. **Objective: minimize**
$$
\text{RMSLE} = \sqrt{\frac{1}{n} \sum_{i=i}^n (\log{(y_i + 1)} - \log{(\hat{y}_i + 1)})^2}
$$
- **Mean absolute error (MAE)**: Similar to MSE but rather than squaring, it just takes the mean absolute difference between the actual and predicted values. **Objective: minimize**
$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^n (|y_i - \hat{y}_i|)
$$
- **Deviance**: If the response variable distribution is Gaussian, then it will be approximately equal to MSE. When not, it usually gives a more useful estimate of error. It is often used with *classification models* and compares a saturated model (i.e. fully featured model) to an unsaturated model (i.e. intercept only or average) to provide the degree to which a model explains the variation in a set of data. **Objective: minimize**
<br>
### Classification models
- **Error (misclassification) rate**: It represents the overall error. **Objective: minimize**
$$
I(y_{0} \neq \hat{y}_{0}) =
\begin{cases}
1 & \text{If } y_{0} \neq \hat{y}_{0} \\
0 & \text{If } y_{0} = \hat{y}_{0}
\end{cases}
$$
$$
\text{Ave}(I(y_{0} \neq \hat{y}_{0}))
$$
- **Mean per class error**: This is the average error rate for each class. If your classes are balanced this will be identical to misclassification. **Objective: minimize**
$$
\begin{split}
\text{Ave}(& \text{Ave}(I(y_{0} \neq \hat{y}_{0}))_1, \\
& \text{Ave}(I(y_{0} \neq \hat{y}_{0}))_2, \\
& \dots, \\
& \text{Ave}(I(y_{0} \neq \hat{y}_{0}))_\text{n-class})
\end{split}
$$
- **Mean squared error (MSE)**: Computes the distance from 1 to the probability assign by the model to the correct category ($\hat{p}$). The squared component results in large differences in probabilities for the true class having larger penalties. **Objective: minimize**
$$
MSE = \frac{1}{n} \sum_{i=1}^n (1 - \hat{p}_i)^2
$$
- **pseudo-**$R^2$**:** Measure of the "deviance explained" by any GLM. **Objective: maximize**.
$$
pseudoR^2 = 1 - \frac{deviance}{null.deviance}
$$
- **Cross-entropy (aka Log Loss or Deviance)**: Similar to MSE but it incorporates a log of the predicted probability multiplied by the true class, it disproportionately punishes predictions where we predict a small probability for the true class (*having high confidence in the wrong answer is really bad*). **Objective: minimize**.
- **Gini index**: Mainly used with tree-based methods and commonly referred to as a *measure of purity* where a small value indicates that *a node contains predominantly observations from a single class*. **Objective: minimize**
- **Confusion Matrix**: Compares actual categorical levels (or events) to the predicted categorical levels
![](img/14-confution-matrix.png){fig-align="center"}
Some metrics related with the confusion matrix that need to be **maximized** are:
-
- **Accuracy**: Overall, how often is the classifier correct? Opposite of misclassification above. $\frac{\text{TP} + \text{TN}}{N + P}$.
- **Precision**: For the *number of predictions* that we made, how many were correct? $\frac{\text{TP}}{\text{TP} + \text{FP}}$.
- **Sensitivity (aka recall)**: For the *events* that occurred, how many did we predict? $\frac{\text{TP}}{\text{TP} + \text{FN}}$.
- **Specificity**: How accurately does the classifier classify actual negative events? $\frac{\text{TN}}{\text{TN} + \text{FP}}$.
- **Area under the curve (AUC)**: A good binary classifier will have high precision and sensitivity.To capture this balance, we often use a *ROC (receiver operating characteristics) curve* that plots the false positive rate along the x-axis and the true positive rate along the y-axis. A line that is diagonal from the lower left corner to the upper right corner represents a random guess. The higher the line is in the upper left-hand corner, the better. AUC computes the area under this curve.
![](img/15-ROC-curve.png){fig-align="center"}
You can more metrics in the next table.
![](img/14-confution-matrix-metrics.png){fig-align="center"}