05-LC.Rmd



# Linear Classifiers
$\mathcal F = \{\text{all linear classifiers}\}$, $\mathcal X = R^d$, $\mathcal Y = \{-1,1\}$ or $\{1,\cdots,K\}$. Consider mainly on binary case.

Let $\beta \in R^d$ and $\beta_0 \in R$, $\mathcal H_{\beta, \beta_0} = \{x\in R^d: <\beta, x> + \beta_0 = 0\}$ (hyperplane), $\mathcal H^+_{\beta, \beta_0} = \{x\in R^d: <\beta, x> + \beta_0 \geq 0\}$ and $\mathcal H^-_{\beta, \beta_0} = \{x\in R^d: <\beta, x> + \beta_0 < 0\}$ (half plane).

$$
\mathfrak f_{\beta, \beta_0}(x):=\begin{cases}1 & <\beta,x> + \beta_0 \geq 0 (\Leftrightarrow x \in H^+_{\beta,\beta_0})\\- 1& <\beta,x> + \beta_0 < 0(\Leftrightarrow x \in H^-_{\beta,\beta_0})\end{cases}
$$


- Question: How to find a linear classifier based on $(x_i, y_i)$, $i=1,\cdots,n$.  
A: Three different ways to tune $\beta, \beta_0$ from data: LDA (linear discriminant analysis) or QDA (quadratic discriminant analysis), Logistic Regression, Perceptrons and SVMs.

## LDA or QDA

Let $\mathcal Y = \{1, \cdots, K\}$, \((X,Y) \sim \rho\), where \(P(Y=k) = \omega_k, \rho_{X|Y}(X | Y=k) = N(\mu_k, \Sigma_k)\), we have:

$$
\begin{aligned}
\mathfrak f_{Bayes}(x) &= \underset{k=1,\cdots, K}{\operatorname{argmax}} P(Y=k | X=x)\\
&= \underset{k=1,\cdots, K}{\operatorname{argmax}} \frac{\rho_{X|Y}(x | Y=k) \cdot \omega_k}{\rho_X(x)}\\
&= \underset{k=1,\cdots, K}{\operatorname{argmax}} [\rho_{X|Y}(x | Y=k) \cdot \omega_k]\\
&= \underset{k=1,\cdots, K}{\operatorname{argmax}} [log\Big (\rho_{X|Y}(x | Y=k)\Big ) + log(\omega_k)]\\
&= \underset{k=1,\cdots, K}{\operatorname{argmin}} [-log\Big (\frac{1}{(2\pi)^{d/2} (det(\Sigma_k))^{1/2}} \Big ) + \frac{1}{2} <\Sigma_k^{-1} (x-\mu_k), (x-\mu_k)> -  log(\omega_k)]\\
&:= \underset{k=1,\cdots, K}{\operatorname{argmin}} \delta_k(x)
\end{aligned}
$$

Observation: \(\delta_k(x)\) is quadratic and convex in \(x\).

**QDA**: What if we only have \((x_i, y_i)\), \(i=1,\cdots,n\)? We use the observations to estimate \(\mu_k, \omega_k, \Sigma_k\), \(k=1,\cdots, K\).

:::{.example}
$$
\hat \mu_k := \frac{\sum_{i ~s.t. ~y_i=k} x_i}{\#\{x_i ~s.t. ~y_i=k\} ~(:= N_k)}
$$

$$
(\hat \Sigma_k)_{lm} := (\frac{1}{N_k} \sum_{i ~s.t. ~y_i=k} x_{il}x_{im} - \hat \mu_{kl} \hat \mu_{km}), \text{ where } l=1,\cdots,d, ~m=1,\cdots,d.
$$

$$
\hat \omega_k = \frac{N_k}{n}
$$

$$
\hat \delta_k(x) \text{ same as } \delta_k \text{ but with } \hat{} \text{ everywhere}
$$
:::


**LDA**: What if we had assumed \(\Sigma_1 = \Sigma_2 = \cdots = \Sigma_K = \Sigma\)?

$$
\begin{aligned}
\mathfrak f(x) &= \underset{k=1,\cdots, K}{\operatorname{argmin}} [\frac{1}{2} <\Sigma^{-1}x, x> + <\Sigma^{-1}x, \mu_k> + \frac{1}{2} <\Sigma^{-1}\mu_k, \mu_k> - log(\omega_k)]\\
&= \underset{k=1,\cdots, K}{\operatorname{argmin}} [<\Sigma^{-1}x, \mu_k> + \frac{1}{2} <\Sigma^{-1}\mu_k, \mu_k> - log(\omega_k)]\\
&:= \underset{k=1,\cdots, K}{\operatorname{argmin}} l_k(x), ~(l_k(x) \text{ is linear})
\end{aligned}
$$

We can estimate \(\mu_k ,\omega_k\) by \(\hat \mu_k ,\hat \omega_k\), and \(\Sigma\) with the full data set.

## Logistic Regression
Let $\mathcal Y = \{1, \cdots, K\}$, \(\vec \beta_k \in R^d\), \(\beta_{0k} \in R\), and \((X,Y)\) satisfies: \(P(Y=k | X=x) = \frac{exp(<x,\vec \beta_k> + \beta_{0k})}{1 + \sum_{l=1}^{K-1} exp(<x,\vec \beta_l> + \beta_{0l})}\), \(P(Y=K | X=x) = \frac{1}{1 + \sum_{l=1}^{K-1} exp(<x,\vec \beta_l> + \beta_{0l})}\), where \(k=1,\cdots,K-1\). Let \(\varphi_k(x) := exp(<x,\vec \beta_k> + \beta_{0k})\) and \(\varphi_K(x):=1\), where \(k=1,\cdots,K-1\). Then we have:

$$
\begin{aligned}
\mathfrak f_{Bayes} (x) &= \underset{k=1,\cdots, K}{\operatorname{argmax}} P(Y=k | X=x)\\
&= \underset{k=1,\cdots, K}{\operatorname{argmax}} \varphi_k(x)\\
&= \underset{k=1,\cdots, K}{\operatorname{argmax}} log(\varphi_k(x))
\end{aligned}
$$

What if we only have observed \((x_i, y_i)\), \(i=1,\cdots,n\)? We use the observations to estimate the parameters.

:::{.example #mle name="MLE"}
Given the data, find the best parameters (the ones maximizing the likelihood of the observations), i.e.

$$
\{(\vec \beta_k^*, \beta_{0k}^*)\} = \underset{\{(\vec \beta_k, \beta_{0k})\}_{k=1,\cdots,K-1}}{\operatorname{max}} \prod_{i=1}^n P(Y = y_i | X = x_i)
$$
:::

## Perceptrons and SVMs
Let $\mathcal Y = \{-1, 1\}$, 
\((x_i, y_i)_{i=1,\cdots,n}\), 
\((\vec \beta, \beta_0)\), 
$\mathfrak f_{\beta, \beta_0}(x):=\begin{cases}1 & <\beta,x> + \beta_0 \geq 0 (\Leftrightarrow x \in H^+_{\beta,\beta_0})\\- 1& <\beta,x> + \beta_0 < 0(\Leftrightarrow x \in H^-_{\beta,\beta_0})\end{cases}$, 
\(\sigma(\vec \beta, \beta_0) := \sum_{i \in \mathcal M_{\vec \beta, \beta_0}} dist(x_i, \mathcal H_{\vec \beta, \beta_0})\), 
where \(\mathcal M_{\vec \beta, \beta_0} = \{i \text{ s.t. } \mathfrak f_{\vec \beta, \beta_0} (x_i) \neq y_i\}\)

### Perceptrons
Let \((\vec \beta^*, \beta_0^*) := \underset{(\vec \beta, \beta_{0})}{\operatorname{min}} \sigma(\vec \beta, \beta_0)\), then the perceptron classifier is \(\mathfrak f_{\vec \beta^*, \beta_0^*} (x)\). There exist many solutions of perceptron problems but some hyperplanes seem to be more robust:

:::{.example #perceptron}
```{r message=F, warning=F, echo=F}
library(tidyverse)
library(patchwork)
example = data.frame(x = rep(1:4, each = 3),
                     y = rep(1:3, 4), 
                     cols = rep(c(-1,1), each=6))
p1 = example %>% ggplot() +
  geom_point(aes(x, y, color = factor(cols))) +
  geom_abline(intercept = -4.3, slope = 2.5, color = "black") + 
  labs(x="", y="", color = "class") +
  ggtitle("hyperplane 1")

p2 = example %>% ggplot() +
  geom_point(aes(x, y, color = factor(cols))) +
  geom_vline(aes(xintercept = 2.5), color = "black") + 
  labs(x="", y="", color = "class") +
  ggtitle("hyperplane 2")

p1+p2 + plot_layout(guides = "collect")
```

Hyperplane 2 seems to be more robust.
:::

### SVMs
See next chapter.