25-matrices-in-R.qmd

# Matrices in R

* With matrices, variables for each observation are stored in a row, resulting in a matrix with as many columns as variables. 

* In statistics we refer to values represented in the rows of the matrix as the *covariates* and in machine learning we refer to them as the *features*.

## Case study: MNIST {#sec-mnist}

An example comes from handwritten digits. The first step in handling mail received in the post office is sorting letters by zip code:

![](https://rafalab.dfci.harvard.edu/dsbook-part-2/ml/img/how-to-write-a-address-on-an-envelope-how-to-write-the-address-on-an-envelope-write-address-on-envelope-india-finishedenvelope-x69070.png)

These are the digitized images:

```{r digit-images-example, echo=FALSE, cache=TRUE, message=FALSE}
library(tidyverse)
library(dslabs)
if (!exists("mnist")) mnist <- read_mnist()
tmp <- lapply( c(1,4,5), function(i){
    expand.grid(Row = 1:28, Column = 1:28) |>  
      mutate(id = i, label = mnist$train$label[i],  
             value = unlist(mnist$train$images[i,])) 
})
tmp <- Reduce(rbind, tmp)
tmp |> ggplot(aes(Row, Column, fill = value)) + 
    geom_raster(show.legend = FALSE) + 
    scale_y_reverse() +
    scale_fill_gradient(low = "white", high = "black") +
    facet_grid(.~label)
```

* The images are converted into $28 \times 28 = 784$ pixels

* For each pixel, we obtain a grey scale intensity between 0 (white) and 255 (black). 


```{r example-images, echo=FALSE}
tmp |> ggplot(aes(Row, Column, fill = value)) + 
    geom_point(pch = 21) + 
    scale_y_reverse() +
    scale_fill_gradient(low = "white", high = "black") +
    facet_grid(.~label)
```

For each digitized image, indexed by $i$, we are provided 784 variables and a categorical outcome, or *label*, representing which digit among $0, 1, 2, 3, 4, 5, 6, 7 , 8,$ and $9$ the image is representing. Let's load the data using the **dslabs** package:

```{r}
library(tidyverse)
library(dslabs)
if (!exists("mnist")) mnist <- read_mnist()
``` 

In these cases, the pixel intensities are saved in a matrix:

```{r}
class(mnist$train$images)
```

* This matrix represents 60,000 observations, each a digit. 

* For example, let's take a smaller subset:

```{r}
x <- mnist$train$images[1:300,] 
y <- mnist$train$labels[1:300]
```


## Mathematical notation {#sec-matrix-notation}

* When working with linear algebra in R we have three types of objects:

1. scalars, 
2. vectors, and 
3. matrices. 

* A scalar is just one number, for example $a = 1$. 

* Vectors are like the numeric vectors we define in R: 

```{r}
length(x[20,])
```

* Each feature is represented by a columns of `x`. For example, the first column contains the values for the first pixel of all 1,000 images:

```{r}
length(x[,1])
```

In matrix algebra, we use lower case bold letters to represent a vector of features/predictors/covariates:

$$ 
\mathbf{x} =
\begin{pmatrix}
x_1\\\
x_2\\\
\vdots\\\
x_n
\end{pmatrix}
$$

Similarly, we can use math notation to represent different features by adding an index:

$$ 
\mathbf{x}_1 = \begin{pmatrix}
x_{1,1}\\
\vdots\\
x_{n,1}
\end{pmatrix} \mbox{ and }
\mathbf{x}_2 = \begin{pmatrix}
x_{1,2}\\
\vdots\\
x_{n,2}
\end{pmatrix}
$$

* A matrix can be defined as a series of vectors of the same size joined together as columns:

```{r}
x_1 <- 1:5
x_2 <- 6:10
cbind(x_1, x_2)
```

* Mathematically, we represent them with bold upper case letters:

$$ 
\mathbf{X} = ( \mathbf{x}_1 \, \mathbf{x}_2 ) = \begin{pmatrix}
x_{1,1}&x_{1,2}\\
\vdots&\vdots\\
x_{n,1}&x_{n,2}
\end{pmatrix}
$$


* We can use this notation to denote an arbitrary number of predictors with the following $n\times p$ matrix, for example, with $n = 300$, and $p=784$:

$$
\mathbf{X} = 
\begin{pmatrix}
  x_{1,1}&x_{1,2}&\dots & x_{1,p} \\
  x_{2,1}&x_{2,2}&\dots & x_{2,p} \\
  \vdots & \vdots & \ddots & \vdots & \\
  x_{n,1}&x_{n,2}&\dots & x_{n,p} 
  \end{pmatrix}
$$

* The _dimension_ of a matrix is often an important characteristic needed to assure that certain operations can be performed. 

```{r}
dim(x)
```

::: {.callout-warning}
## Notation for rows versus columns

Bold lower case letter are also commonly used to represent rows, rather than columns, of a matrix. This can be confusing because $\mathbf{x}_1$ can represent either the first row or the first column. One way to distinguish then is using notation similar to computer code by using the colon $:$ to represent _all_. So $\mathbf{X}_{1,:}$ is a row, the first row and all the columns, and  $\mathbf{X}_{:,1}$ is a column, the first column and all the rows. Another approach is to distinguish by the index, with $i$ used for rows and $j$ used for columns. So $\mathbf{x}_i$ is the $i$th row and $\mathbf{x}_j$ is the $j$th column. With this approach it is important to clarify which dimension, row or column, is being represented. We use this last one in the next chapter.
:::


## Converting vectors to a matrices

* Vectors can be thought of as $n\times 1$ matrices. However, in R, a vector does not have dimensions:

```{r}
dim(x_1)
```

* Vectors are not matrices in R. However, we can easily convert then to a matrix: 

```{r}
dim(matrix(x_1))
```

* It is also possible to change the dimensions of the resulting matrix. 

* To see an example of how can this be useful, consider wanting to visualize the the rows pixel intensities in their original $28\times28$ grid.


```{r}
my_vector <- 1:15
mat <- matrix(my_vector, 5, 3)
mat
```

* We can fill by row by using the `byrow` argument:

```{r}
mat_t <- matrix(my_vector, 3, 5, byrow = TRUE)
mat_t
```

## Motivating questions

To motivate the use of matrices in R, we will pose five questions/challenges related to the handwritten digits data:

1\. Do some digits require more ink to write than others? We will study the distribution of the total pixel darkness and how it varies by digits.

2\. Are some pixels uninformative? We will study the variation of each pixel across digits and remove predictors (columns) associated with pixels that don't change much and thus can't provide much information for classification.

3\. Can we remove smudges? We will first, look at the distribution of all pixel values. Then we will use this to pick a cutoff to define unwritten space. Then, set anything below that cutoff to 0.

4\. Binarize the data. First, we will look at the distribution of all pixel values. We will then use this to pick a cutoff to distinguish between writing and no writing. Then, we will convert all entries into either 1 or 0.

5\. Standardize the digits. We will scale each of the predictors in each entry to have the same average and standard deviation.


:::{.callout-warning}
The `matrix` function recycles values in the vector **without warning** if the product of columns and rows does not match the length of the vector:

```{r}
matrix(1:3, 2, 5)
```
:::

* To put the pixel intensities of our, say, 3rd entry, which is a `r mnist$train$label[3]` into grid, we can use:

```{r}
grid <- matrix(x[3,], 28, 28)
```

* Confirm with plots

```{r, eval=FALSE}
image(1:28, 1:28, grid)
image(1:28, 1:28, grid[, 28:1])
```


## Row and column summaries


* The function `rowSums` takes a matrix as input and computes the desired values:

```{r}
sums <- rowSums(x)
```

* We can also compute the averages with `rowMeans` 

```{r}
avg <- rowMeans(x)
```

* For the first task, look at boxplots:

```{r boxplot-of-digit-averages}
boxplot(avg ~ y)
```

* Is this expected?

* We can compute the column sums and averages using the function `colSums` and `colMeans`, respectively.

The **matrixStats** package adds functions that performs operations on each row or column very efficiently, including the functions `rowSds` and `colSds`.

## `apply`


* The `apply` function lets you apply any function, not just `sum` or `mean`, to the rows or columns of a matrix. 

* The first argument is the matrix, the second is the dimension, 1 for rows, 2 for columns, and the third is the function. 

* So, for example, `rowMeans` can be written as:

```{r}
avgs <- apply(x, 1, mean)
```

* for the sds:

```{r}
sds <- apply(x, 2, sd)
```

* The trade off for this flexibility is that these operations are not as fast as dedicated functions such as `rowMeans`.

## Filtering columns based on summaries

* We now turn to task 2: studying the variation of each pixel and removing columns associated with pixels that don't change much and thus do not inform the classification. 


```{r}
library(matrixStats)
sds <- colSds(x)
```

* A quick look at the distribution of these values shows that some pixels have very low entry to entry variability:

```{r}
hist(sds, breaks = 30, main = "SDs")
```


* To see where the low variance pixels are:

```{r}
image(1:28, 1:28, matrix(sds, 28, 28)[, 28:1])
```


* We see that there is little variation in the corners.

* We could remove features that have no variation since these can't help us predict. We can extract columns from matrices using the following code:

```{r, eval=FALSE}
x[ ,c(351,352)]
```

and rows like this:

```{r, eval=FALSE}
x[c(2,3),]
```

* We can also use logical indexes to determine which columns or rows to keep. 

```{r}
new_x <- x[,colSds(x) > 60]
dim(new_x)
```

* Only the columns for which the standard deviation is above 60 are kept, which removes over half the predictors.

* Here we add an important warning related to subsetting matrices: if you select one column or one row, the result is no longer a matrix but a vector.

```{r}
class(x[, 1])
dim(x[,1 ])
```

* However, we can preserve the matrix class by using the argument `drop=FALSE`:

```{r}
class(x[, 1, drop = FALSE])
dim(x[, 1, drop = FALSE])
```

## Indexing with matrices

* We can turn matrices into vectors.

```{r}
mat <- matrix(1:15, 5, 3)
as.vector(mat)
```

To see a histogram of all our predictor data, we can use:


```{r}
hist(as.vector(x), breaks = 30, main = "Pixel intensities")
```

* We notice a clear dichotomy which is explained as parts of the image with ink and parts without. If we think that values below, say, 50 are smudges, we can quickly make them zero using:

```{r, eval=FALSE}
new_x <- x
new_x[new_x < 50] <- 0
```

* To see what this does, we look at a smaller matrix:

```{r}
mat <- matrix(1:15, 5, 3)
mat[mat < 9] <- 0
mat
```

* We can also use logical operations with matrix logical:

```{r}
mat <- matrix(1:15, 5, 3)
mat[mat > 6 & mat < 12] <- 0
mat
```

## Binarizing the data

The histogram above seems to suggest that this data is mostly binary. A pixel either has ink or does not. Using what we have learned, we can binarize the data using just matrix operations:

```{r}
bin_x <- x
bin_x[bin_x < 255/2] <- 0 
bin_x[bin_x > 255/2] <- 1
```

* We can also convert to a matrix of logicals and then coerce to numbers like this:

```{r}
bin_X <- (x > 255/2)*1
image(matrix(x[3,], 28, 28)[,28:1])
image(matrix(bin_X[3,], 28, 28)[,28:1])

```

## Vectorization for matrices

In R, if we subtract a vector from a matrix, the first element of the vector is subtracted from the first row, the second element from the second row, and so on. Using mathematical notation, we would write it as follows:

$$
 \begin{pmatrix}
  X_{1,1}&\dots & X_{1,p} \\
  X_{2,1}&\dots & X_{2,p} \\
   & \vdots & \\
  X_{n,1}&\dots & X_{n,p} 
  \end{pmatrix}
-
\begin{pmatrix}
a_1\\\
a_2\\\
\vdots\\\
a_n
\end{pmatrix}
=
\begin{pmatrix}
  X_{1,1}-a_1&\dots & X_{1,p} -a_1\\
  X_{2,1}-a_2&\dots & X_{2,p} -a_2\\
   & \vdots & \\
  X_{n,1}-a_n&\dots & X_{n,p} -a_n
  \end{pmatrix}
$$

* The same holds true for other arithmetic operations. This implies that we can scale each row of a matrix like this:

```{r, eval=FALSE}
(x - rowMeans(x)) / rowSds(x)
```

* If you want to scale each column, be careful since this approach does not work for columns. To perform a similar operation, we convert the columns to rows using the transpose `t`, proceed as above, and then transpose back:

```{r, eval=FALSE}
t(t(X) - colMeans(X))
```

* We can also use a function called `sweep` that works similarly to `apply`. It takes each entry of a vector and subtracts it from the corresponding row or column.

```{r, eval=FALSE}
X_mean_0 <- sweep(x, 2, colMeans(x))
```

* The function `sweep` actually has another argument that lets you define the arithmetic operation. So to divide by the standard deviation, we do the following:

```{r}
x_mean_0 <- sweep(x, 2, colMeans(x))
x_standardized <- sweep(x_mean_0, 2, colSds(x), FUN = "/")
```

* In R, if you add, subtract, nultiple or divide two matrices, the operation is done elementwise. For example, if two matrices are stored in `x` and `y`, then

```{r, eval=FALSE}
x*y
```

does not result in matrix multiplication. Instead, the entry in row $i$ and column $j$ of this product is the product of the entryin row $i$ and column $j$ of `x` and `y`, respectively.

## Exercises

(@) Create a 100 by 10 matrix of randomly generated normal numbers. Put the result in `x`.

```{r}
x <- matrix(rnorm(100*10), nrow = 100, ncol = 10)
```


(@) Apply the three R functions that give you the dimension of `x`, the number of rows of `x`, and the number of columns of `x`, respectively.

```{r}
dim(x)
nrow(x)
ncol(x)
```


(@) Add the scalar 1 to row 1, the scalar 2 to row 2, and so on, to the matrix `x`.

```{r}
#x <- sweep(x, 1, 1:nrow(x), FUN = "+")
x <- x + 1:nrow(x)
```


(@) Add the scalar 1 to column 1, the scalar 2 to column 2, and so on, to the matrix `x`. Hint: use `sweep` with `FUN = "+"`.

```{r}
x <- sweep(x, 2, 1:ncol(x), FUN = "+")
```


(@) Compute the average of each row of `x`.

```{r}
rowMeans(x)
```


(@) Compute the average of each column of `x`.

```{r}
colMeans(x)
```

(@) For each digit in the MNIST training data, compute the proportion of pixels that are in a *grey area*, defined as values between 50 and 205. Make boxplot by digit class. Hint: use logical operators and `rowMeans`.

```{r}
ga <- rowMeans(mnist$train$images > 50 & mnist$train$images < 205)
boxplot(ga ~ mnist$train$labels)
```