-
Notifications
You must be signed in to change notification settings - Fork 4
/
Lecture06.Rmd
575 lines (416 loc) · 15 KB
/
Lecture06.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
---
title: "MBB 243 - Lecture 6"
author: "Ryan Morin"
date: "`r Sys.Date()`"
output:
powerpoint_presentation
---
```{r,eval=T,echo=F,message=F}
library(tidyverse)
```
## Learning Objectives
- Understand some of the basic applications of regular expressions in R
- Create sequences of integers using `:`
- Understand relationship between higher-level data structures
- Provide example use cases for data frames versus matrices
- Subset vectors and data structures using the `[` and `$` operator
- Subset elements/rows/columns using logical operations or vectors and reorder vectors
## Some new R functions
|function name|What it does|
|-|-|
|mean|Calculate the mean (average) of numbers|
|min|Return the smallest number in a vector|
|max|Return the largest number in a vector|
|sum|Add together all numbers in a vector|
|table|Count and summarize unique occurrences of each value in a vector|
|unlist|Convert a list into a vector, collapsing nested data if necessary|
|order|Return the indexes of a vector in the order of the values
|apply/lapply/sapply|Run the same function separately on each element in a list (or each row/column of a matrix or data frame)|
|grep/grepl/gsub|Use regex to find and/or substitute sequences based on pattern matching|
## Data structures in R
![](images/data_types_R.png)
## Review and level up: subsetting
- Accessing one element in a vector uses indexing or the names (for named vectors)
- Accessing/subsetting into a vector of any length (usually <= the original) can be done using a
- numeric vector of desired indexes
- character vector of desired names
- Can also use an equal length vector of logicals (TRUE and FALSE), where only indexes with TRUE in the corresponding position will be extracted
## Review and level up: subsetting
```{r,echo=T,eval=T}
things = c(c=333,d=4444,e=0,a=1,b=22)
names(things)
things[2]
# things["d"] #same as above
things[c(2,3)]
# things[c("d","e")] #same as above
some = c(FALSE,TRUE,FALSE,TRUE)
things[some] #same as things[c(2,4)]
```
## Enforcing a specific order
- The order of the indexes as specified forces the ordering of the extracted vector
- `order` provides a convenient way to sort data when used used with subsetting
```{r,echo=T,eval=T}
alphabet = c("d","c","b","a")
order(alphabet) # indexes based on alphabetical order
order(things) # indexes for ordered values
ao = order(names(things)) # alphabetical order of names
things[ao] # vector reordered based on names
```
## "Subsetting" with a longer vector
- You can extract the same element multiple times by specifying more than once in your vector
```{r}
short_vec = c(0,1,2)
long_vec = short_vec[c(1,2,3,3,2,1)]
long_vec #contains the elements from short_vec once for each time specified in short_vec
```
## The List
- Vectors in R must only contain values of one type (logical, numeric, character)
- Lists don't have this limitation and can even allow you to store vectors or other objects
```{r}
list_example <- list(FALSE, TRUE, 3.0, "four")
list_example
```
## List with names
```{r}
a_list <- list(title = "Numbers", numbers = 1:6, data = c(TRUE,FALSE))
# Here the : operator is used to make a range interpolating from 1 to 6
a_list
```
## Lists: reference by name or indexes
- `$` is used to refer to list elements by name
- Similar to how dictionaries work in Python (i.e. name/key:value)
- Numeric index can also be used with `[[]]`
```{r}
a_list$numbers #reference by name using $
a_list["numbers"] #gives you a list, probably not what you want
a_list[["numbers"]] #gives you a vector
a_list[[2]] #gives you that same vector
```
## From list to vector
- Sometimes you will need to convert a list or data frame row into a vector
- `unlist` facilitates this but use with caution because R will try to flatten lists and name nested elements for you
```{r}
a_vec = unlist(a_list)
a_vec[c(1:4)]
class(a_vec)
```
## From list to vector
- Sometimes you will need to convert a list or data frame row into a vector
- `unlist` facilitates this but use with caution because R will try to flatten lists and name nested elements for you
```{r}
a_vec = unlist(a_list)
a_vec
```
## Lists of numeric data
- Data "sets" may be structured conveniently in a list
- If everything is the same data type this can be a convenient way to represent data of variable sizes
```{r,echo=T,eval=T}
sample_1 = c(77,33,11)
sample_2 = c(1234,5,678,9)
sample_3 = c(11,1,0,3)
m = list("S1"=sample_1,
"S2"=sample_2,
"S3"=sample_3)
```
## For X do Y (but no looping!)
- What is the mean/sum/max value in every sample stored in our list?
- It's tempting to use a loop to perform mathematical or other operations on each nested data element
- R has better ways to do this
- For lists, we should always try to use `lapply` or `sapply`
- Syntax is identical, respectively return a list or vector
## Applying a function to a list
```{r,echo=T,eval=T}
sapply(m,mean) # calculate means, return vector
l = lapply(m,min) # calculate min, return list
l
unlist(l)
```
## How and why s/lapply?
- The sapply/lapply functions allow you to "apply" non-vectorized functions to data in lists or other structures
- They are vectorized, so automatically work on every element in your input and return an equal-length result
- Can be combined with your own custom functions
```{r,eval=F,echo=T}
# mean_vec = sapply(m,mean) is equivalent to this loop
mean_vec = c()
for(i in c(1:length(m))){ #annoying
this_mean = mean(m[[i]]) #ugly
mean_vec = c(mean_vec,this_mean) #inefficient
}
```
## Working with tabular data
- As you saw in the last few labs, you can use Python to work with tabular data stored in files
- Owing to vectorization and native data types for tabular data, R makes many common analysis tasks much more convenient
- Paradigm we will use for this course:
- Python/Command-line to manipulate data into a tabular form
- R to perform all subsequent analyses on tabular data
## The Matrix
![](images/matrix.png)
## The actual matrix
```{r}
matrix_example <- matrix(1:24, ncol=6, nrow=4)
matrix_example
```
```{r}
class(matrix_example)
typeof(matrix_example)
```
## Matrix as a neatly folded vector
- Notice the order the matrix is created, as if the vector is chopped up and lined up starting from the left
- The order of the elements is maintained if you trace from top to bottom, starting in the first column
```{r,eval=T,echo=T}
matrix(c("A","B","C","D","E","F"),nrow=3)
```
## Assignment and operations
```{r}
matrix_example[] = 0
# we can assign every position in a matrix to the same value!
matrix_example
```
## Assignment and operations
```{r}
matrix_example[,1] = 99
# or we can assign every position in a column
matrix_example
```
## Assignment and operations
```{r}
matrix_example[1,] = 3
# or we can assign every position in a row
matrix_example
```
## Assignment and operations
```{r}
matrix_example[] = matrix_example + 10
# or we can modify everything in a similar way
matrix_example
```
## Assignment and operations
```{r}
matrix_example[4,4] = 4
# or we can assign/reference individual positions
# using both row and column index
matrix_example
```
## Recycling in vector operations
- Reminder: vectorized operations treat operations involving vectors of unequal length vs two equal length vectors differently
```{r,echo=T,eval=T}
1 + c(1,1,1)
c(1,1,1) + c(1,1,1)
3 * c(1,2,3)
c(3,0,1) * c(1,2,3)
```
## Recycling in matrix operations
- Operations involving a matrix recycle along the first then second dimension
```{r,echo=T,eval=T}
tiny = matrix(c(0,0,0,0,0,0),nrow=3)
tiny
tiny + 2 #applied to every element in the matrix
```
## Recycling in matrix operations
```{r,echo=T,eval=T}
tiny + c(1,2)
# vector is used in the operation in order along the first dimension of the matrix (columns)
# vector is then recycled until the operation has been applid to the entire matrix
#Equivalent to
#tiny[1,1] + 1
#tiny[2,1] + 2
#tiny[3,1] + 1
#tiny[4,1] + 2
#etc
```
## Recycling in matrix operations
- Operations involving a matrix and a vector are tricky but somewhat sensible
```{r,echo=T,eval=T}
tiny
tiny + c(1,2,3)
```
## Broke: loops, Woke: apply
:::::::::::::: {.columns}
::: {.column}
- Common to process two-dimensional data structures such as matrices to produce one-dimensional results
- e.g. Summary statistic on every row or column
- Another place loops may be tempting, but we must resist the temptation
:::
::: {.column}
![](images/apply.jpeg)
:::
::::::::::::::
## Apply a function to either dimension
![](images/apply_1.png)
## Example: convert to percentages
- If each row is a group, what % of the total does each column represent in that row?
- e.g. 100 * 2 / (323 + 89 + 42)
```{r,eval=T,echo=T}
mini = matrix(c(98,2,0,323,89,42,2,1,6,1,1,1),
nrow=4,ncol=3,byrow = T)
mini
```
## Step 1: denominator (row sums)
- Usage: `apply(X, MARGIN, FUN, ...)`
- Not intuitive, but "margin" is just the dimension
- How I remember: `[dimension1,dimension2]`
```{r,eval=T,echo=T}
denominators = apply(mini,1,sum)
# equivalent shortcut: denominators = rowSums(mini)
# for columns, the other dimension, it would be: apply(mini,2,sum)
denominators
```
## Step 2: numerator
```{r,eval=T,echo=T}
mini
percents = 100 * mini/denominators
# the rest can be done on the whole matrix at once thanks to recycling
# What do you expect this to do?
# mini[c(1,2),]
# And this?
# mini[order(denominators),]
```
## Subsetting rows or columns
```{r,eval=T,echo=T}
mini[c(1,2),] #regular subsetting using a vector
denominators
order(denominators)
# sepecify row index in the desired order
mini[order(denominators),]
```
## The data frame
:::::::::::::: {.columns}
::: {.column}
- A data frame is a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column
- A data frame can be envisioned as the equivalent of a spreadsheet
:::
::: {.column}
![](images/data_frame_R.png)
:::
::::::::::::::
## characteristics of a data frame
- Column names should be non-empty and R will assign (ugly) names when necessary
- Row names and column names _must_ be unique
- The data stored in a data frame can be of numeric, factor or character type
- Each column can have different data type
- Each column must contain same number of elements
## Loading tabular data
```{r load_df,echo=T,message=F}
myfile="data/GSE125966_GOYA_mini.csv"
gene_expression = read.csv(myfile,row.names = 1)
class(gene_expression)
#calling rownames/colnames will return a vector containing the current row/column names
# Calling these function in an assignment context
# will re-assign names to the rows/columns, replacing existing names
colnames(gene_expression) = c("sample1","sample2")
rownames(gene_expression)[c(1:5)] #first 5 names
```
## Rownames and Colnames
- A matrix can also have row names and column names, which can be set the same way as data frames
- Matrix row and column names do not have to be unique (potentially very problematic!)
```{r,echo=T,eval=T}
mat <- matrix(1:4, ncol=2)
rownames(mat)=c("ROW","ROW")
colnames(mat)=c("COL1","COL2")
mat["ROW",] #only first match is returned
```
## Exploring the Data Frame
- There are a number of functions that work well with data frames
- The `nrow()` and `ncol()` functions count the number of rows and columns in the object
```{r}
nrow(gene_expression)
ncol(gene_expression)
```
## Peeking with head and tail
- The `head()` and `tail()` functions are useful for extracting the first or last few rows in a data frame. These are especially useful for larger data frames.
```{r}
head(gene_expression, 2)
tail(gene_expression, 2)
some_expression = head(gene_expression)
```
## Another data frame to explore
```{r,echo=T,eval=T}
geno_file = "data/Morin_genotypes.txt"
genotype_df = read.csv(geno_file,sep="\t")
# read without specifying a column for row names
# default rows will be consecutive numbers represented as strings
rownames(genotype_df)[c(1:10)]
head(genotype_df,3)
```
## Extracting rows/columns
- Rows and columns are indexed numerically
- The two dimensions are always row,column
- The most basic operation you can do is extracting/referencing a column
- Option 1: use the `$` operator and column name
- Option 2/3: use the column name/index:
`object_name[,column_index]`
## Extracting rows/columns
```{r}
some_expression$sample1
#extract column using $ operator and column name
some_expression[,1]
#extract the first column
some_expression[,"sample1"]
#extract the column named sample1
```
## Rows vs Columns
- An extracted column from a data frame becomes a vector
- Sadly, the same is NOT true of extracted rows
- A single extracted row remains a data frame and must be converted
```{r}
some_expression[1,] #extract the first row
class(some_expression[1,])
row_vec = unlist(some_expression[1,])
class(row_vec)
```
## Summarizing the contents
Another useful function is called `summary()` which will display statistics for numeric columns and tabulate values in text columns.
```{r}
summary(gene_expression)
```
## Regex in R
- R was designed for doing mathematical and statistical analyses
- Working with text/character/string type data is much easier in Python and clunkier in R
- Three common use cases for string operations:
- string replacement using pattern or literal matching
- string splitting using pattern or literal matching
- searching for the existence of a pattern
## Flavours of grep in R
|Original|Desired|Function|Comment|
|-|-|-|-|
|GeneID:1000|1000|sub/gsub|Replace or drop first/all matching portions of string|
|GeneID:1000|TRUE/FALSE|grepl|check for existence of a match|
|ACATGC|ACATGC|grep|Obtain all matching values|
## Gsub example
- What if we wanted to drop the redundant "GeneID:" part of our row names?
```{r,echo=T,eval=T}
head(some_expression,3)
original_rownames = rownames(some_expression)
new_rownames = gsub("GeneID:","",original_rownames)
rownames(some_expression) = new_rownames
head(some_expression,3)
```
## Grep example
- What if we wanted to find all rows that had a genotype that was homozygous A (i.e. AA)?
- The `table` function tabulates the unique values in the resulting vector
```{r,echo=T,eval=T}
table(grepl("AA",genotype_df$genotype))
# How many are homozygous of any of the four nucleotides?
table(grepl("AA|GG|CC|TT",genotype_df$genotype))
```
## Subsetting data frame with logicals
- Revisiting the previous example, we'll calculate the % of homozygous genotypes on autosomes
- To accomplish this we need to ignore chromosomes MT, X and Y, negating grepl using `!`
```{r}
tail(table(genotype_df$chromosome))
autosome_rows = ! grepl("X|Y|MT",genotype_df$chromosome)
head(autosome_rows)
```
## Final steps: % homozygous
```{r}
table(autosome_rows)
autosome_genotypes = genotype_df[autosome_rows,]
homozygous_total = sum(grepl("AA|GG|CC|TT",autosome_genotypes$genotype))
all = length(autosome_genotypes$genotype)
percent_homo = 100 * homozygous_total/all
percent_homo
```
## More Resources
- http://swcarpentry.github.io/r-novice-gapminder/04-data-structures-part1
- http://swcarpentry.github.io/r-novice-gapminder/06-data-subsetting
- http://swcarpentry.github.io/r-novice-gapminder/09-vectorization