-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path06-generative-classification-models.qmd
236 lines (161 loc) · 9.41 KB
/
06-generative-classification-models.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
---
format:
html:
number-depth: 3
css: summary-format.css
---
# Generative Models for Classification
These models instead of trying to predict the **posterior probability** ( $Pr(Y=k|X=x)$ ) directly, they try to estimate the distribution of the predictors $X$ separately in each of the response classes $Y$ ( $f_{k}(X) = Pr(X|Y=k)$ ). Then, they use the **Bayes' Theorem** and the **overall or prior probability** $\pi_{k}$ (probability of a randomly chosen observation comes from the $k$th class) to flip these around into estimates for $Pr(Y=k|X=x)$ by approximating the ***Bayes Classifier***, which has the lowest ***total error rate***.
$$
p_{k}(x) = Pr(Y = k | X = x) = \frac{\pi_{k} f_{k}(x)} {\sum_{l=1}^{K} \pi_{l} f_{l}(x)}
$$
Estimating the *prior probability* can be as easy calculate $\hat{\pi}_{k} = n_{k}/ n$ for each $Y$ class by assuming that the trainning data its representative of the population, but estimating the density function of $X$ for each class $f_{k}$ it's more challenging, so models need to make more simplifying assumptions to estimate it.
## Linear Discriminant Analysis (LDA)
This model assumes that:
- The density function of $X$ for each $Y$ class $f_{k}$ follows a **Normal (Gaussian) distribution** within each class. Even though, it is often remarkably robust to model violations like Boolean variables.
- $X$ has a **different mean** across all $Y$ classes $\mu_{1}^2 \neq \dots \neq \mu_{k}^2$.
- $X$ has a **common variance** across all $Y$ classes $\sigma_{1}^2 = \dots = \sigma_{k}^2$.
To understand how the model calculates its parameters, let's see the **discriminant function** when the number of predictors is $p=1$ and the number of $Y$ classes is $K=2$.
$$
\begin{split}
\delta_{k}(x) & = \log{ \left( p_{x}(x) \right)} \\
& = \log{(\pi_{k})}
- \frac{\mu_{k}^2}{2\sigma^2}
+ x \cdot \frac{\mu_{k}}{\sigma^2}
\end{split}
$$
In this function, it's clear that a class $k$ has more possibilities to be selected as mean of $x$ for that particular class increases and its variance decreases. It is also important to take in consideration the effect of $\log{(\pi_{k})}$, in consequence the proportion of classes also influence the results.
If we want to extend the model to work with $p \geq 1$ we also need to consider that:
- Each individual predictor follows a one-dimensional normal distribution
- There is some correlation between each pair of predictors
As result, the *discriminant function* is:
$$
\begin{split}
\delta_{k}(x) & = \log{\pi_{k}} - \frac{1}{2} \mu_{k}^T \Sigma^{-1} \mu_{k} \\
& \quad + x^T \Sigma^{-1} \mu_{k}
\end{split}
$$
- Where:
- $x$ refers to a vector the current value of each $p$ element.
- $\mu$ refers to a vector with the mean of each predictor.
- $\Sigma$ refers to the covariance matrix $p \times p$ of $\text{Cov}(X)$.
The model also can be extended to handle $K > 2$ after defining the $K$ class as the baseline, we can extend the *discriminant function* to have the next form:
$$
\begin{split}
\delta_{k}(x) & = \log{ \left(
\frac{Pr(Y = k|K=x)}
{Pr(Y=K|X=x)}
\right)} \\
& = \log{ \left( \frac{\pi_{k}}{\pi_{K}} \right)}
- \frac{1}{2} (\mu_{k} + \mu_{K})^T \Sigma^{-1} (\mu_{k} - \mu_{K}) \\
& \quad + x^{T} \Sigma^{-1} (\mu_{k} - \mu_{K})
\end{split}
$$
### Coding example
To perform **LDA** we just need to create the model specification by loading the **discrim** package and using **MASS** engine.
```{r}
library(tidymodels)
library(ISLR)
library(discrim)
Smarket_train <-
Smarket %>%
filter(Year != 2005)
Smarket_test <-
Smarket %>%
filter(Year == 2005)
lda_spec <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
SmarketLdaPredictions <-
lda_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train) |>
augment(new_data = Smarket_test)
conf_mat(SmarketLdaPredictions, truth = Direction, estimate = .pred_class)
accuracy(SmarketLdaPredictions, truth = Direction, estimate = .pred_class)
```
## Quadratic Discriminant Analysis (QDA)
Like LDA, the QDA classifier plugs estimates for the parameters into Bayes’ theorem in order to perform prediction results and assumes that:
- The observations from each class are drawn from a Gaussian distribution
- Each class has its own **covariance matrix**, $X \sim N(\mu_{k}, \Sigma_{k})$
Under this assumption, the Bayes classifier assigns an observation $X = x$ to the class for which $\delta_{k}(x)$ is largest.
$$
\begin{split}
\delta_{k}(x) = & \quad \log{\pi_{k}}
- \frac{1}{2} \log{|\Sigma_{k}|}
- \frac{1}{2} \mu_{k}^T \Sigma_{k}^{-1}\mu_{k} \\
& + x^T \Sigma_{k}^{-1} \mu_{k} \\
& - \frac{1}{2} x^T \Sigma_{k}^{-1} x
\end{split}
$$
In consequence, QDA is more flexible than LDA and has the potential to be more accurate in settings where interactions among the predictors are important in discriminating between classes or when we need non-linear decision boundaries.
The model also can be extended to handle $K > 2$ after defining the $K$ class as the baseline, we can extend the *discriminant function* to have the next form:
$$
\log{ \left( \frac{Pr(Y = k|K=x)}{Pr(Y=K|X=x)} \right)} =
a_k + \sum_{j=1}^{p}b_{kj}x_{j} +
\sum_{j=1}^{p} \sum_{l=1}^{p} c_{kjl} x_{j}x_{l}
$$
Where $a_k$, $b_{kj}$ and $c_{kjl}$ are functions of $\pi_{k}$, $\pi_{K}$, $\mu_{k}$, $\mu_{K}$, $\Sigma_{k}$ and $\Sigma_{K}$
### Coding example
To perform **QDA** we just need to create the model specification by loading the **discrim** package and using **MASS** engine.
```{r}
qda_spec <- discrim_quad() %>%
set_mode("classification") %>%
set_engine("MASS")
SmarketQdaPredictions <-
qda_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train) |>
augment(new_data = Smarket_test)
conf_mat(SmarketQdaPredictions, truth = Direction, estimate = .pred_class)
accuracy(SmarketQdaPredictions, truth = Direction, estimate = .pred_class)
```
## Naive Bayes
To estimate $f_{k}(X)$ this model assumes that *Within the kth class, the p predictors are independent* (correlation = 0) and as consequence:
$$
f_{k}(x) = f_{k1}(x_{1}) \times f_{k2}(x_{2}) \times \dots \times f_{kp}(x_{p})
$$
Even thought the assumption might not be true, the model often leads to pretty decent results, especially in settings where **_n_ is not large enough relative to _p_** for us to effectively estimate the joint distribution of the predictors within each class. It has been used to classify **text data**, for example, to predict whether an **email is spam or not**.
To estimate the one-dimensional density function $f_{kj}$ using training data we have the following options:
- We can assume that $X_{j}|Y = k \sim N(\mu_{jk}, \sigma_{jk}^2)$
- We can estimate the distribution by defining bins and creating a histogram
- We can estimate the distribution by use a kernel density estimator
- If $X_{j}$ is **qualitative**, we can count the proportion of training observations for the $j$th predictor corresponding to each class.
The model also can be extended to handle $K > 2$ after defining the $K$ class as the baseline, we can extend the function to have the next form:
$$
\log{ \left( \frac{Pr(Y = k|K=x)}{Pr(Y=K|X=x)} \right)} =
\log{ \left(
\frac{\pi_{k}}
{\pi_{K}}
\right)}
+
\log{ \left(
\frac{\prod_{j=1}^{p} f_{kj}(x_{j}) }
{\prod_{j=1}^{p} f_{Kj}(x_{j}) }
\right)}
$$
### The infrequent problem
The method has the problem that if you don't an example for a particular event in your training set it would estimate the probability of that event as 0.
![](img/33-naive-bayes-infrequent-problem.png)
The solution to this problem involves adding a small number, usually '1', to each event and outcome combination to eliminate this veto power. This is called the **Laplace correction** or *Laplace estimator*. After adding this correction, each Venn diagram now has at least a small bit of overlap; there is no longer any joint probability of zero.
![](img/34-naive-bayes-Laplace-correction.png)
### Pre-processing
This method works better with categories, so if your data has *numeric data* try to **bin** it in categories by:
- Turning an Age variable in the 'child' or 'adult' categories
- Turning geographic coordinates into geographic regions like 'West' or 'East'
- Turning test scores into four groups by percentile
- Turning hour into 'morning', 'afternoon' and 'evening'
- Turning temperature into 'cold', 'warm' and 'hot'
As this method works really well when we have few examples and many predictors we can transform **text documents** into a *Document Term Matrix (DTM)* using a **bag-of-words model** with package like `tidytext` or `tm`.
### Coding example
To perform **Naive Bayes** we just need to create the model specification by loading the **discrim** package and using **klaR** engine. We can apply *Laplace correction* by setting `Laplace = 1` in the `parsnip::naive_Bayes` function.
```{r}
nb_spec <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("klaR") %>%
set_args(usekernel = FALSE)
SmarketNbPredictions <-
nb_spec %>%
fit(Direction ~ Lag1 + Lag2, data = Smarket_train) |>
augment(new_data = Smarket_test)
conf_mat(SmarketNbPredictions, truth = Direction, estimate = .pred_class)
accuracy(SmarketNbPredictions, truth = Direction, estimate = .pred_class)
```