-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy path25-matrices-in-R.qmd
511 lines (346 loc) · 13.6 KB
/
25-matrices-in-R.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
# Matrices in R
* With matrices, variables for each observation are stored in a row, resulting in a matrix with as many columns as variables.
* In statistics we refer to values represented in the rows of the matrix as the *covariates* and in machine learning we refer to them as the *features*.
## Case study: MNIST {#sec-mnist}
An example comes from handwritten digits. The first step in handling mail received in the post office is sorting letters by zip code:
![](https://rafalab.dfci.harvard.edu/dsbook-part-2/ml/img/how-to-write-a-address-on-an-envelope-how-to-write-the-address-on-an-envelope-write-address-on-envelope-india-finishedenvelope-x69070.png)
These are the digitized images:
```{r digit-images-example, echo=FALSE, cache=TRUE, message=FALSE}
library(tidyverse)
library(dslabs)
if (!exists("mnist")) mnist <- read_mnist()
tmp <- lapply( c(1,4,5), function(i){
expand.grid(Row = 1:28, Column = 1:28) |>
mutate(id = i, label = mnist$train$label[i],
value = unlist(mnist$train$images[i,]))
})
tmp <- Reduce(rbind, tmp)
tmp |> ggplot(aes(Row, Column, fill = value)) +
geom_raster(show.legend = FALSE) +
scale_y_reverse() +
scale_fill_gradient(low = "white", high = "black") +
facet_grid(.~label)
```
* The images are converted into $28 \times 28 = 784$ pixels
* For each pixel, we obtain a grey scale intensity between 0 (white) and 255 (black).
```{r example-images, echo=FALSE}
tmp |> ggplot(aes(Row, Column, fill = value)) +
geom_point(pch = 21) +
scale_y_reverse() +
scale_fill_gradient(low = "white", high = "black") +
facet_grid(.~label)
```
For each digitized image, indexed by $i$, we are provided 784 variables and a categorical outcome, or *label*, representing which digit among $0, 1, 2, 3, 4, 5, 6, 7 , 8,$ and $9$ the image is representing. Let's load the data using the **dslabs** package:
```{r}
library(tidyverse)
library(dslabs)
if (!exists("mnist")) mnist <- read_mnist()
```
In these cases, the pixel intensities are saved in a matrix:
```{r}
class(mnist$train$images)
```
* This matrix represents 60,000 observations, each a digit.
* For example, let's take a smaller subset:
```{r}
x <- mnist$train$images[1:300,]
y <- mnist$train$labels[1:300]
```
## Mathematical notation {#sec-matrix-notation}
* When working with linear algebra in R we have three types of objects:
1. scalars,
2. vectors, and
3. matrices.
* A scalar is just one number, for example $a = 1$.
* Vectors are like the numeric vectors we define in R:
```{r}
length(x[20,])
```
* Each feature is represented by a columns of `x`. For example, the first column contains the values for the first pixel of all 1,000 images:
```{r}
length(x[,1])
```
In matrix algebra, we use lower case bold letters to represent a vector of features/predictors/covariates:
$$
\mathbf{x} =
\begin{pmatrix}
x_1\\\
x_2\\\
\vdots\\\
x_n
\end{pmatrix}
$$
Similarly, we can use math notation to represent different features by adding an index:
$$
\mathbf{x}_1 = \begin{pmatrix}
x_{1,1}\\
\vdots\\
x_{n,1}
\end{pmatrix} \mbox{ and }
\mathbf{x}_2 = \begin{pmatrix}
x_{1,2}\\
\vdots\\
x_{n,2}
\end{pmatrix}
$$
* A matrix can be defined as a series of vectors of the same size joined together as columns:
```{r}
x_1 <- 1:5
x_2 <- 6:10
cbind(x_1, x_2)
```
* Mathematically, we represent them with bold upper case letters:
$$
\mathbf{X} = ( \mathbf{x}_1 \, \mathbf{x}_2 ) = \begin{pmatrix}
x_{1,1}&x_{1,2}\\
\vdots&\vdots\\
x_{n,1}&x_{n,2}
\end{pmatrix}
$$
* We can use this notation to denote an arbitrary number of predictors with the following $n\times p$ matrix, for example, with $n = 300$, and $p=784$:
$$
\mathbf{X} =
\begin{pmatrix}
x_{1,1}&x_{1,2}&\dots & x_{1,p} \\
x_{2,1}&x_{2,2}&\dots & x_{2,p} \\
\vdots & \vdots & \ddots & \vdots & \\
x_{n,1}&x_{n,2}&\dots & x_{n,p}
\end{pmatrix}
$$
* The _dimension_ of a matrix is often an important characteristic needed to assure that certain operations can be performed.
```{r}
dim(x)
```
::: {.callout-warning}
## Notation for rows versus columns
Bold lower case letter are also commonly used to represent rows, rather than columns, of a matrix. This can be confusing because $\mathbf{x}_1$ can represent either the first row or the first column. One way to distinguish then is using notation similar to computer code by using the colon $:$ to represent _all_. So $\mathbf{X}_{1,:}$ is a row, the first row and all the columns, and $\mathbf{X}_{:,1}$ is a column, the first column and all the rows. Another approach is to distinguish by the index, with $i$ used for rows and $j$ used for columns. So $\mathbf{x}_i$ is the $i$th row and $\mathbf{x}_j$ is the $j$th column. With this approach it is important to clarify which dimension, row or column, is being represented. We use this last one in the next chapter.
:::
## Converting vectors to a matrices
* Vectors can be thought of as $n\times 1$ matrices. However, in R, a vector does not have dimensions:
```{r}
dim(x_1)
```
* Vectors are not matrices in R. However, we can easily convert then to a matrix:
```{r}
dim(matrix(x_1))
```
* It is also possible to change the dimensions of the resulting matrix.
* To see an example of how can this be useful, consider wanting to visualize the the rows pixel intensities in their original $28\times28$ grid.
```{r}
my_vector <- 1:15
mat <- matrix(my_vector, 5, 3)
mat
```
* We can fill by row by using the `byrow` argument:
```{r}
mat_t <- matrix(my_vector, 3, 5, byrow = TRUE)
mat_t
```
## Motivating questions
To motivate the use of matrices in R, we will pose five questions/challenges related to the handwritten digits data:
1\. Do some digits require more ink to write than others? We will study the distribution of the total pixel darkness and how it varies by digits.
2\. Are some pixels uninformative? We will study the variation of each pixel across digits and remove predictors (columns) associated with pixels that don't change much and thus can't provide much information for classification.
3\. Can we remove smudges? We will first, look at the distribution of all pixel values. Then we will use this to pick a cutoff to define unwritten space. Then, set anything below that cutoff to 0.
4\. Binarize the data. First, we will look at the distribution of all pixel values. We will then use this to pick a cutoff to distinguish between writing and no writing. Then, we will convert all entries into either 1 or 0.
5\. Standardize the digits. We will scale each of the predictors in each entry to have the same average and standard deviation.
:::{.callout-warning}
The `matrix` function recycles values in the vector **without warning** if the product of columns and rows does not match the length of the vector:
```{r}
matrix(1:3, 2, 5)
```
:::
* To put the pixel intensities of our, say, 3rd entry, which is a `r mnist$train$label[3]` into grid, we can use:
```{r}
grid <- matrix(x[3,], 28, 28)
```
* Confirm with plots
```{r, eval=FALSE}
image(1:28, 1:28, grid)
image(1:28, 1:28, grid[, 28:1])
```
## Row and column summaries
* The function `rowSums` takes a matrix as input and computes the desired values:
```{r}
sums <- rowSums(x)
```
* We can also compute the averages with `rowMeans`
```{r}
avg <- rowMeans(x)
```
* For the first task, look at boxplots:
```{r boxplot-of-digit-averages}
boxplot(avg ~ y)
```
* Is this expected?
* We can compute the column sums and averages using the function `colSums` and `colMeans`, respectively.
The **matrixStats** package adds functions that performs operations on each row or column very efficiently, including the functions `rowSds` and `colSds`.
## `apply`
* The `apply` function lets you apply any function, not just `sum` or `mean`, to the rows or columns of a matrix.
* The first argument is the matrix, the second is the dimension, 1 for rows, 2 for columns, and the third is the function.
* So, for example, `rowMeans` can be written as:
```{r}
avgs <- apply(x, 1, mean)
```
* for the sds:
```{r}
sds <- apply(x, 2, sd)
```
* The trade off for this flexibility is that these operations are not as fast as dedicated functions such as `rowMeans`.
## Filtering columns based on summaries
* We now turn to task 2: studying the variation of each pixel and removing columns associated with pixels that don't change much and thus do not inform the classification.
```{r}
library(matrixStats)
sds <- colSds(x)
```
* A quick look at the distribution of these values shows that some pixels have very low entry to entry variability:
```{r}
hist(sds, breaks = 30, main = "SDs")
```
* To see where the low variance pixels are:
```{r}
image(1:28, 1:28, matrix(sds, 28, 28)[, 28:1])
```
* We see that there is little variation in the corners.
* We could remove features that have no variation since these can't help us predict. We can extract columns from matrices using the following code:
```{r, eval=FALSE}
x[ ,c(351,352)]
```
and rows like this:
```{r, eval=FALSE}
x[c(2,3),]
```
* We can also use logical indexes to determine which columns or rows to keep.
```{r}
new_x <- x[,colSds(x) > 60]
dim(new_x)
```
* Only the columns for which the standard deviation is above 60 are kept, which removes over half the predictors.
* Here we add an important warning related to subsetting matrices: if you select one column or one row, the result is no longer a matrix but a vector.
```{r}
class(x[, 1])
dim(x[,1 ])
```
* However, we can preserve the matrix class by using the argument `drop=FALSE`:
```{r}
class(x[, 1, drop = FALSE])
dim(x[, 1, drop = FALSE])
```
## Indexing with matrices
* We can turn matrices into vectors.
```{r}
mat <- matrix(1:15, 5, 3)
as.vector(mat)
```
To see a histogram of all our predictor data, we can use:
```{r}
hist(as.vector(x), breaks = 30, main = "Pixel intensities")
```
* We notice a clear dichotomy which is explained as parts of the image with ink and parts without. If we think that values below, say, 50 are smudges, we can quickly make them zero using:
```{r, eval=FALSE}
new_x <- x
new_x[new_x < 50] <- 0
```
* To see what this does, we look at a smaller matrix:
```{r}
mat <- matrix(1:15, 5, 3)
mat[mat < 9] <- 0
mat
```
* We can also use logical operations with matrix logical:
```{r}
mat <- matrix(1:15, 5, 3)
mat[mat > 6 & mat < 12] <- 0
mat
```
## Binarizing the data
The histogram above seems to suggest that this data is mostly binary. A pixel either has ink or does not. Using what we have learned, we can binarize the data using just matrix operations:
```{r}
bin_x <- x
bin_x[bin_x < 255/2] <- 0
bin_x[bin_x > 255/2] <- 1
```
* We can also convert to a matrix of logicals and then coerce to numbers like this:
```{r}
bin_X <- (x > 255/2)*1
image(matrix(x[3,], 28, 28)[,28:1])
image(matrix(bin_X[3,], 28, 28)[,28:1])
```
## Vectorization for matrices
In R, if we subtract a vector from a matrix, the first element of the vector is subtracted from the first row, the second element from the second row, and so on. Using mathematical notation, we would write it as follows:
$$
\begin{pmatrix}
X_{1,1}&\dots & X_{1,p} \\
X_{2,1}&\dots & X_{2,p} \\
& \vdots & \\
X_{n,1}&\dots & X_{n,p}
\end{pmatrix}
-
\begin{pmatrix}
a_1\\\
a_2\\\
\vdots\\\
a_n
\end{pmatrix}
=
\begin{pmatrix}
X_{1,1}-a_1&\dots & X_{1,p} -a_1\\
X_{2,1}-a_2&\dots & X_{2,p} -a_2\\
& \vdots & \\
X_{n,1}-a_n&\dots & X_{n,p} -a_n
\end{pmatrix}
$$
* The same holds true for other arithmetic operations. This implies that we can scale each row of a matrix like this:
```{r, eval=FALSE}
(x - rowMeans(x)) / rowSds(x)
```
* If you want to scale each column, be careful since this approach does not work for columns. To perform a similar operation, we convert the columns to rows using the transpose `t`, proceed as above, and then transpose back:
```{r, eval=FALSE}
t(t(X) - colMeans(X))
```
* We can also use a function called `sweep` that works similarly to `apply`. It takes each entry of a vector and subtracts it from the corresponding row or column.
```{r, eval=FALSE}
X_mean_0 <- sweep(x, 2, colMeans(x))
```
* The function `sweep` actually has another argument that lets you define the arithmetic operation. So to divide by the standard deviation, we do the following:
```{r}
x_mean_0 <- sweep(x, 2, colMeans(x))
x_standardized <- sweep(x_mean_0, 2, colSds(x), FUN = "/")
```
* In R, if you add, subtract, nultiple or divide two matrices, the operation is done elementwise. For example, if two matrices are stored in `x` and `y`, then
```{r, eval=FALSE}
x*y
```
does not result in matrix multiplication. Instead, the entry in row $i$ and column $j$ of this product is the product of the entryin row $i$ and column $j$ of `x` and `y`, respectively.
## Exercises
(@) Create a 100 by 10 matrix of randomly generated normal numbers. Put the result in `x`.
```{r}
x <- matrix(rnorm(100*10), nrow = 100, ncol = 10)
```
(@) Apply the three R functions that give you the dimension of `x`, the number of rows of `x`, and the number of columns of `x`, respectively.
```{r}
dim(x)
nrow(x)
ncol(x)
```
(@) Add the scalar 1 to row 1, the scalar 2 to row 2, and so on, to the matrix `x`.
```{r}
#x <- sweep(x, 1, 1:nrow(x), FUN = "+")
x <- x + 1:nrow(x)
```
(@) Add the scalar 1 to column 1, the scalar 2 to column 2, and so on, to the matrix `x`. Hint: use `sweep` with `FUN = "+"`.
```{r}
x <- sweep(x, 2, 1:ncol(x), FUN = "+")
```
(@) Compute the average of each row of `x`.
```{r}
rowMeans(x)
```
(@) Compute the average of each column of `x`.
```{r}
colMeans(x)
```
(@) For each digit in the MNIST training data, compute the proportion of pixels that are in a *grey area*, defined as values between 50 and 205. Make boxplot by digit class. Hint: use logical operators and `rowMeans`.
```{r}
ga <- rowMeans(mnist$train$images > 50 & mnist$train$images < 205)
boxplot(ga ~ mnist$train$labels)
```