-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-LC.Rmd
124 lines (93 loc) · 5.68 KB
/
05-LC.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
# Linear Classifiers
$\mathcal F = \{\text{all linear classifiers}\}$, $\mathcal X = R^d$, $\mathcal Y = \{-1,1\}$ or $\{1,\cdots,K\}$. Consider mainly on binary case.
Let $\beta \in R^d$ and $\beta_0 \in R$, $\mathcal H_{\beta, \beta_0} = \{x\in R^d: <\beta, x> + \beta_0 = 0\}$ (hyperplane), $\mathcal H^+_{\beta, \beta_0} = \{x\in R^d: <\beta, x> + \beta_0 \geq 0\}$ and $\mathcal H^-_{\beta, \beta_0} = \{x\in R^d: <\beta, x> + \beta_0 < 0\}$ (half plane).
$$
\mathfrak f_{\beta, \beta_0}(x):=\begin{cases}1 & <\beta,x> + \beta_0 \geq 0 (\Leftrightarrow x \in H^+_{\beta,\beta_0})\\- 1& <\beta,x> + \beta_0 < 0(\Leftrightarrow x \in H^-_{\beta,\beta_0})\end{cases}
$$
- Question: How to find a linear classifier based on $(x_i, y_i)$, $i=1,\cdots,n$.
A: Three different ways to tune $\beta, \beta_0$ from data: LDA (linear discriminant analysis) or QDA (quadratic discriminant analysis), Logistic Regression, Perceptrons and SVMs.
## LDA or QDA
Let $\mathcal Y = \{1, \cdots, K\}$, \((X,Y) \sim \rho\), where \(P(Y=k) = \omega_k, \rho_{X|Y}(X | Y=k) = N(\mu_k, \Sigma_k)\), we have:
$$
\begin{aligned}
\mathfrak f_{Bayes}(x) &= \underset{k=1,\cdots, K}{\operatorname{argmax}} P(Y=k | X=x)\\
&= \underset{k=1,\cdots, K}{\operatorname{argmax}} \frac{\rho_{X|Y}(x | Y=k) \cdot \omega_k}{\rho_X(x)}\\
&= \underset{k=1,\cdots, K}{\operatorname{argmax}} [\rho_{X|Y}(x | Y=k) \cdot \omega_k]\\
&= \underset{k=1,\cdots, K}{\operatorname{argmax}} [log\Big (\rho_{X|Y}(x | Y=k)\Big ) + log(\omega_k)]\\
&= \underset{k=1,\cdots, K}{\operatorname{argmin}} [-log\Big (\frac{1}{(2\pi)^{d/2} (det(\Sigma_k))^{1/2}} \Big ) + \frac{1}{2} <\Sigma_k^{-1} (x-\mu_k), (x-\mu_k)> - log(\omega_k)]\\
&:= \underset{k=1,\cdots, K}{\operatorname{argmin}} \delta_k(x)
\end{aligned}
$$
Observation: \(\delta_k(x)\) is quadratic and convex in \(x\).
**QDA**: What if we only have \((x_i, y_i)\), \(i=1,\cdots,n\)? We use the observations to estimate \(\mu_k, \omega_k, \Sigma_k\), \(k=1,\cdots, K\).
:::{.example}
$$
\hat \mu_k := \frac{\sum_{i ~s.t. ~y_i=k} x_i}{\#\{x_i ~s.t. ~y_i=k\} ~(:= N_k)}
$$
$$
(\hat \Sigma_k)_{lm} := (\frac{1}{N_k} \sum_{i ~s.t. ~y_i=k} x_{il}x_{im} - \hat \mu_{kl} \hat \mu_{km}), \text{ where } l=1,\cdots,d, ~m=1,\cdots,d.
$$
$$
\hat \omega_k = \frac{N_k}{n}
$$
$$
\hat \delta_k(x) \text{ same as } \delta_k \text{ but with } \hat{} \text{ everywhere}
$$
:::
**LDA**: What if we had assumed \(\Sigma_1 = \Sigma_2 = \cdots = \Sigma_K = \Sigma\)?
$$
\begin{aligned}
\mathfrak f(x) &= \underset{k=1,\cdots, K}{\operatorname{argmin}} [\frac{1}{2} <\Sigma^{-1}x, x> + <\Sigma^{-1}x, \mu_k> + \frac{1}{2} <\Sigma^{-1}\mu_k, \mu_k> - log(\omega_k)]\\
&= \underset{k=1,\cdots, K}{\operatorname{argmin}} [<\Sigma^{-1}x, \mu_k> + \frac{1}{2} <\Sigma^{-1}\mu_k, \mu_k> - log(\omega_k)]\\
&:= \underset{k=1,\cdots, K}{\operatorname{argmin}} l_k(x), ~(l_k(x) \text{ is linear})
\end{aligned}
$$
We can estimate \(\mu_k ,\omega_k\) by \(\hat \mu_k ,\hat \omega_k\), and \(\Sigma\) with the full data set.
## Logistic Regression
Let $\mathcal Y = \{1, \cdots, K\}$, \(\vec \beta_k \in R^d\), \(\beta_{0k} \in R\), and \((X,Y)\) satisfies: \(P(Y=k | X=x) = \frac{exp(<x,\vec \beta_k> + \beta_{0k})}{1 + \sum_{l=1}^{K-1} exp(<x,\vec \beta_l> + \beta_{0l})}\), \(P(Y=K | X=x) = \frac{1}{1 + \sum_{l=1}^{K-1} exp(<x,\vec \beta_l> + \beta_{0l})}\), where \(k=1,\cdots,K-1\). Let \(\varphi_k(x) := exp(<x,\vec \beta_k> + \beta_{0k})\) and \(\varphi_K(x):=1\), where \(k=1,\cdots,K-1\). Then we have:
$$
\begin{aligned}
\mathfrak f_{Bayes} (x) &= \underset{k=1,\cdots, K}{\operatorname{argmax}} P(Y=k | X=x)\\
&= \underset{k=1,\cdots, K}{\operatorname{argmax}} \varphi_k(x)\\
&= \underset{k=1,\cdots, K}{\operatorname{argmax}} log(\varphi_k(x))
\end{aligned}
$$
What if we only have observed \((x_i, y_i)\), \(i=1,\cdots,n\)? We use the observations to estimate the parameters.
:::{.example #mle name="MLE"}
Given the data, find the best parameters (the ones maximizing the likelihood of the observations), i.e.
$$
\{(\vec \beta_k^*, \beta_{0k}^*)\} = \underset{\{(\vec \beta_k, \beta_{0k})\}_{k=1,\cdots,K-1}}{\operatorname{max}} \prod_{i=1}^n P(Y = y_i | X = x_i)
$$
:::
## Perceptrons and SVMs
Let $\mathcal Y = \{-1, 1\}$,
\((x_i, y_i)_{i=1,\cdots,n}\),
\((\vec \beta, \beta_0)\),
$\mathfrak f_{\beta, \beta_0}(x):=\begin{cases}1 & <\beta,x> + \beta_0 \geq 0 (\Leftrightarrow x \in H^+_{\beta,\beta_0})\\- 1& <\beta,x> + \beta_0 < 0(\Leftrightarrow x \in H^-_{\beta,\beta_0})\end{cases}$,
\(\sigma(\vec \beta, \beta_0) := \sum_{i \in \mathcal M_{\vec \beta, \beta_0}} dist(x_i, \mathcal H_{\vec \beta, \beta_0})\),
where \(\mathcal M_{\vec \beta, \beta_0} = \{i \text{ s.t. } \mathfrak f_{\vec \beta, \beta_0} (x_i) \neq y_i\}\)
### Perceptrons
Let \((\vec \beta^*, \beta_0^*) := \underset{(\vec \beta, \beta_{0})}{\operatorname{min}} \sigma(\vec \beta, \beta_0)\), then the perceptron classifier is \(\mathfrak f_{\vec \beta^*, \beta_0^*} (x)\). There exist many solutions of perceptron problems but some hyperplanes seem to be more robust:
:::{.example #perceptron}
```{r message=F, warning=F, echo=F}
library(tidyverse)
library(patchwork)
example = data.frame(x = rep(1:4, each = 3),
y = rep(1:3, 4),
cols = rep(c(-1,1), each=6))
p1 = example %>% ggplot() +
geom_point(aes(x, y, color = factor(cols))) +
geom_abline(intercept = -4.3, slope = 2.5, color = "black") +
labs(x="", y="", color = "class") +
ggtitle("hyperplane 1")
p2 = example %>% ggplot() +
geom_point(aes(x, y, color = factor(cols))) +
geom_vline(aes(xintercept = 2.5), color = "black") +
labs(x="", y="", color = "class") +
ggtitle("hyperplane 2")
p1+p2 + plot_layout(guides = "collect")
```
Hyperplane 2 seems to be more robust.
:::
### SVMs
See next chapter.