linear-regression.Rmd

# (PART\*)  Models {-}


# Linear Regression

We start our demonstrations with a standard regression model via maximum
likelihood or least squares loss. Also included are examples for QR
decomposition and normal equations. This can serve as an entry point for those
starting out in the wider world of computational statistics, as maximum
likelihood is the fundamental approach used in most applied statistics, but
which is also a key aspect of the Bayesian approach. Least squares loss is not
confined to the standard regression setting, but is widely used in more
predictive/'algorithmic' approaches e.g. in machine learning and elsewhere.  You can find a [Python version of the code in the supplemental section](#python-linreg).


## Data Setup

We will simulate some data to make our results known and easier to manipulate.

```{r lm-setup}
library(tidyverse)

set.seed(123)  # ensures replication

# predictors and target
N = 100 # sample size
k = 2   # number of desired predictors
X = matrix(rnorm(N * k), ncol = k)  
y = -.5 + .2*X[, 1] + .1*X[, 2] + rnorm(N, sd = .5)  # increasing N will get estimated values closer to these

dfXy = data.frame(X, y)
```


## Functions 

A maximum likelihood approach.

```{r lm_ml}
lm_ml <- function(par, X, y) {
  # par: parameters to be estimated
  # X: predictor matrix with intercept column
  # y: target
  
  # setup
  beta   = par[-1]                             # coefficients
  sigma2 = par[1]                              # error variance
  sigma  = sqrt(sigma2)
  N = nrow(X)
  
  # linear predictor
  LP = X %*% beta                              # linear predictor
  mu = LP                                      # identity link in the glm sense
  
  # calculate likelihood
  L = dnorm(y, mean = mu, sd = sigma, log = TRUE) # log likelihood
  # L =  -.5*N*log(sigma2) - .5*(1/sigma2)*crossprod(y-mu)    # alternate log likelihood form

  -sum(L)                                      # optim by default is minimization, and we want to maximize the likelihood 
                                               # (see also fnscale in optim.control)
}
```


An approach via least squares loss function.


```{r lm_ls}
lm_ls <- function(par, X, y) {
  # arguments- 
  # par: parameters to be estimated
  # X: predictor matrix with intercept column
  # y: target
  
  # setup
  beta = par                                   # coefficients
  
  # linear predictor
  LP = X %*% beta                              # linear predictor
  mu = LP                                      # identity link
  
  # calculate least squares loss function
  L = crossprod(y - mu)
}
```


## Estimation

Setup for use with <span class="func" style = "">optim</span>.


```{r lm-model-matrix}
X = cbind(1, X)
```

Initial values. Note we'd normally want to handle the sigma differently as it's
bounded by zero, but we'll ignore for demonstration.  Also `sigma2` is not
required for the LS approach as it is the objective function.

```{r lm-est}
init = c(1, rep(0, ncol(X)))
names(init) = c('sigma2', 'intercept', 'b1', 'b2')

fit_ML = optim(
  par = init,
  fn  = lm_ml,
  X   = X,
  y   = y,
  control = list(reltol = 1e-8)
)

fit_LS = optim(
  par = init[-1],
  fn  = lm_ls,
  X   = X,
  y   = y,
  control = list(reltol = 1e-8)
)

pars_ML = fit_ML$par
pars_LS = c(sigma2 = fit_LS$value / (N-k-1), fit_LS$par)  # calculate sigma2 and add
```

##  Comparison
 
Compare to `lm` which uses QR decomposition.

```{r lm}
fit_lm = lm(y ~ ., dfXy)
```


Example of QR.  Not shown.

```{r QR}
# QRX = qr(X)
# Q = qr.Q(QRX)
# R = qr.R(QRX)
# Bhat = solve(R) %*% crossprod(Q, y)
# alternate: qr.coef(QRX, y)
```

```{r lm-compare, echo=FALSE}
rbind(
    fit_ML = pars_ML,
    fit_LS = pars_LS,
    fit_lm = c(summary(fit_lm)$sigma^2, coef(fit_lm))
) %>% 
  kable_df()
```


The slight difference in sigma is roughly dividing by N vs. N-k-1 in the
traditional least squares approach. It diminishes with increasing N as both tend
toward whatever `sd^2` you specify when creating the `y` target above.

Compare to <span class="func" style = "">glm</span>, which by default assumes gaussian family with identity link
and uses `lm.fit`.

```{r glm-compare}
fit_glm = glm(y ~ ., data = dfXy)
summary(fit_glm)
```

Via normal equations.

```{r lm-normal}
coefs = solve(t(X) %*% X) %*% t(X) %*% y  # coefficients
```


Compare.

```{r lm-compare2, results='hold'}
sqrt(crossprod(y - X %*% coefs) / (N - k - 1))
summary(fit_lm)$sigma
sqrt(fit_glm$deviance / fit_glm$df.residual) 
c(sqrt(pars_ML[1]), sqrt(pars_LS[1]))


# rerun by adding 3-4 zeros to the N
```

## Python

The above is available as a Python demo in the [supplemental section](#python-linreg).

## Source

Original code available at https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/standard_lm.R