-
Notifications
You must be signed in to change notification settings - Fork 2
/
Source Code.Rmd
428 lines (245 loc) · 21.6 KB
/
Source Code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
---
title: "Differential Gene Expression Analysis"
author: "Yogindra Raghav, Nityam Rathi, Matt Eckelmeyer, Kyle Coleman"
date: "November 19, 2018"
output: word_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
**Goals**
1. Identify which genes are differentially expressed after dexamethasone treatment.
NOTE: The threshold for classifying a gene as "differentially expressed" will be determined based on the values we obtain after tidying the data. The subset tabulation that includes differentially expressed genes will be the first result of our statistical analysis. The term "signal intensity" refers to how strongly a gene is being expressed at a given time.
2. Construct biplot and volcano plot that will allow us to more easily visualize the genes that are classified as differentially expressed at each of the two time points.
3. Construct statistical prediction model (linear regression) using a random subset of the data. We aim to give the model quantitative information to see how accurate predictions from the model are. Residuals will be plotted and the summary statistics of our regression modeling will be the second result of our statistical analysis.
**Analysis**
Loading packages:
```{r,message=FALSE, warning = FALSE }
library(mdsr) # for data tidying and analysis
library(readr) # for reading in files
```
Loading datasets into variables:
```{r, results = 'hide', message=FALSE}
hour_0_replicate_1 = read_delim("/Users/MECKELMEYER/Desktop/GSM154262.txt",delim = "\t", skip = 9) # load in tab delimited file and skip first 9 lines since they contain statistical information about Agilent Microarray machine that is irrelevant for our analysis.
hour_0_replicate_2 = read_delim("/Users/MECKELMEYER/Desktop/GSM154263.txt",delim = "\t", skip = 9)
hour_0_replicate_3 = read_delim("/Users/MECKELMEYER/Desktop/GSM154264.txt",delim = "\t", skip = 9)
hour_6_replicate_1 = read_delim("/Users/MECKELMEYER/Desktop/GSM154277.txt",delim = "\t", skip = 9)
hour_6_replicate_2 = read_delim("/Users/MECKELMEYER/Desktop/GSM154278.txt",delim = "\t", skip = 9)
hour_6_replicate_3 = read_delim("/Users/MECKELMEYER/Desktop/GSM154279.txt",delim = "\t", skip = 9)
```
Selecting for relevant columns:
```{r}
# GeneName will be used for joining tables of biological replicates.
# gNetSignal gives signal difference between signal intensity and background intensity, therefore giving true normalized measure of intensity.
# gIsWellAboveBG is for genes whose intensity signal is 2.6 standard deviations from average signal intensity.
hour_0_replicate_1 = hour_0_replicate_1 %>% select(ProbeName,GeneName, gNetSignal, gIsWellAboveBG)
hour_0_replicate_2 = hour_0_replicate_2 %>% select(ProbeName,GeneName, gNetSignal, gIsWellAboveBG)
hour_0_replicate_3 = hour_0_replicate_3 %>% select(ProbeName,GeneName, gNetSignal, gIsWellAboveBG)
hour_6_replicate_1 = hour_6_replicate_1 %>% select(ProbeName,GeneName, gNetSignal, gIsWellAboveBG)
hour_6_replicate_2 = hour_6_replicate_2 %>% select(ProbeName,GeneName, gNetSignal, gIsWellAboveBG)
hour_6_replicate_3 = hour_6_replicate_3 %>% select(ProbeName,GeneName, gNetSignal, gIsWellAboveBG)
```
Getting rid of "placeholder" gene names:
```{r}
# Certain signal intensities observed by the instrument could not be associated with any known gene. For these genes, the gene name column was given a placeholder name of the molecular probe ID (ProbeName) that was used to target the gene.
# First we must figure out what "placeholder" values could be possible for genes by looking at the ProbeName column. ProbeName entries are standardized and will have the same possibilities between all data files. For this reason, we will go ahead and use the hour_0_replicate_1 table to find the prefixes of all possible ProbeNames that could be put into the GeneName column.
ProbeName = hour_0_replicate_1$ProbeName
substr(ProbeName, start = 1, stop =2) %>% unique()
# These are possible substrings. Upon further analysis of the "Br" prefix, we realized the "Br" prefix in ProbeName refers to a control sample called "BrightCorner" which we need to get rid of from our data set (assuming it is contained at some point in the GeneName column). Another control GeneName by convention is "NegativeControl" which we will also filter out.
# Let's filter out control entries:
hour_0_replicate_1 = hour_0_replicate_1 %>% filter(GeneName!="BrightCorner"& GeneName!="NegativeControl")
hour_0_replicate_2 = hour_0_replicate_2 %>% filter(GeneName!="BrightCorner"& GeneName!="NegativeControl")
hour_0_replicate_3 = hour_0_replicate_3 %>% filter(GeneName!="BrightCorner"& GeneName!="NegativeControl")
hour_6_replicate_1 = hour_6_replicate_1 %>% filter(GeneName!="BrightCorner" & GeneName!="NegativeControl")
hour_6_replicate_2 = hour_6_replicate_2 %>% filter(GeneName!="BrightCorner" & GeneName!="NegativeControl")
hour_6_replicate_3 = hour_6_replicate_3 %>% filter(GeneName!="BrightCorner" & GeneName!="NegativeControl")
# Let's filter out possible probe entries that are being used as placeholder gene names:
hour_0_replicate_1 = hour_0_replicate_1 %>% filter(substr(GeneName, 1,2) != "(-")
hour_0_replicate_1 = hour_0_replicate_1 %>% filter(substr(GeneName, 1,2) != "(+")
hour_0_replicate_1 = hour_0_replicate_1 %>% filter(substr(GeneName, 1,2) != "A_")
hour_0_replicate_2 = hour_0_replicate_2 %>% filter(substr(GeneName, 1,2) != "(-")
hour_0_replicate_2 = hour_0_replicate_2 %>% filter(substr(GeneName, 1,2) != "(+")
hour_0_replicate_2 = hour_0_replicate_2 %>% filter(substr(GeneName, 1,2) != "A_")
hour_0_replicate_3 = hour_0_replicate_3 %>% filter(substr(GeneName, 1,2) != "(-")
hour_0_replicate_3 = hour_0_replicate_3 %>% filter(substr(GeneName, 1,2) != "(+")
hour_0_replicate_3 = hour_0_replicate_3 %>% filter(substr(GeneName, 1,2) != "A_")
hour_6_replicate_1 = hour_6_replicate_1 %>% filter(substr(GeneName, 1,2) != "(-")
hour_6_replicate_1 = hour_6_replicate_1 %>% filter(substr(GeneName, 1,2) != "(+")
hour_6_replicate_1 = hour_6_replicate_1 %>% filter(substr(GeneName, 1,2) != "A_")
hour_6_replicate_2 = hour_6_replicate_2 %>% filter(substr(GeneName, 1,2) != "(-")
hour_6_replicate_2 = hour_6_replicate_2 %>% filter(substr(GeneName, 1,2) != "(+")
hour_6_replicate_2 = hour_6_replicate_2 %>% filter(substr(GeneName, 1,2) != "A_")
hour_6_replicate_3 = hour_6_replicate_3 %>% filter(substr(GeneName, 1,2) != "(-")
hour_6_replicate_3 = hour_6_replicate_3 %>% filter(substr(GeneName, 1,2) != "(+")
hour_6_replicate_3 = hour_6_replicate_3 %>% filter(substr(GeneName, 1,2) != "A_")
```
Averaging Identical Gene Names:
```{r}
# There are rows that contain identical GeneName entries. It is important for us to average the gNetSignal found for these genes so that we do not consider the same gene as a different data point.
hour_0_replicate_1 = hour_0_replicate_1 %>% group_by(GeneName) %>% summarise(gNetSignal = mean(gNetSignal), gIsWellAboveBG = mean(gIsWellAboveBG)) # for every gene name this averages gNetSignal and gIsWellAboveBG
nrow(hour_0_replicate_1)
hour_0_replicate_2 = hour_0_replicate_2 %>% group_by(GeneName) %>% summarise(gNetSignal = mean(gNetSignal), gIsWellAboveBG = mean(gIsWellAboveBG))
nrow(hour_0_replicate_2)
hour_0_replicate_3 = hour_0_replicate_3 %>% group_by(GeneName) %>% summarise(gNetSignal = mean(gNetSignal), gIsWellAboveBG = mean(gIsWellAboveBG))
nrow(hour_0_replicate_3)
hour_6_replicate_1 = hour_6_replicate_1 %>% group_by(GeneName) %>% summarise(gNetSignal = mean(gNetSignal), gIsWellAboveBG = mean(gIsWellAboveBG))
nrow(hour_6_replicate_1)
hour_6_replicate_2 = hour_6_replicate_2 %>% group_by(GeneName) %>% summarise(gNetSignal = mean(gNetSignal), gIsWellAboveBG = mean(gIsWellAboveBG))
nrow(hour_6_replicate_2)
hour_6_replicate_3 = hour_6_replicate_3 %>% group_by(GeneName) %>% summarise(gNetSignal = mean(gNetSignal), gIsWellAboveBG = mean(gIsWellAboveBG))
nrow(hour_6_replicate_3)
# First 5 GeneName entries after summarise are alpha-numeric strings that are placeholders. As you can see, this data really is messy.
hour_0_replicate_1 = tail(hour_0_replicate_1, -5) # get rid of first five entries.
hour_0_replicate_2 = tail(hour_0_replicate_2, -5)
hour_0_replicate_3 = tail(hour_0_replicate_3, -5)
hour_6_replicate_1 = tail(hour_6_replicate_1, -5)
hour_6_replicate_2 = tail(hour_6_replicate_2, -5)
hour_6_replicate_3 = tail(hour_6_replicate_3, -5)
```
Joining biological replicates:
```{r}
# gNetSignal is normalized value that takes into account changes in machine background intensity for microarrays. Based on this, we can compare gNetSignal values between different runs on the same machine. These were done with biological replicates. For this reason, we are going to average gNetSignal values across the three replicates for each gene.
hour_0_isoform_a = left_join(hour_0_replicate_1, hour_0_replicate_2, by = "GeneName") # we can left_join since we have confirmed with anti_join (not included) and nrow() that the tables have the same genes.
hour_0_isoform_a = hour_0_isoform_a %>% left_join(., hour_0_replicate_3, by = "GeneName") # join it once more
head(hour_0_isoform_a)
hour_0_isoform_a = hour_0_isoform_a %>% group_by(GeneName) %>% summarise(gNetSignal = (gNetSignal.x+gNetSignal.y+gNetSignal)/3, gIsWellAboveBG = (gIsWellAboveBG.x+ gIsWellAboveBG.y + gIsWellAboveBG)/3) # Average the gNetSignal and gIsWellAboveBg values between the 3 biological replicates
head(hour_0_isoform_a)
hour_6_isoform_a = left_join(hour_6_replicate_1, hour_6_replicate_2, by = "GeneName")
hour_6_isoform_a = hour_6_isoform_a %>% left_join(., hour_6_replicate_3, by = "GeneName")
hour_6_isoform_a = hour_6_isoform_a %>% group_by(GeneName) %>% summarise(gNetSignal = (gNetSignal.x+gNetSignal.y+gNetSignal)/3, gIsWellAboveBG = (gIsWellAboveBG.x+ gIsWellAboveBG.y + gIsWellAboveBG)/3)
head(hour_6_isoform_a)
```
Merging both time points into one table and renaming columns:
```{r}
hour_0_hour_6_isoform_a_combined = hour_0_isoform_a %>% left_join(hour_6_isoform_a, by = "GeneName") # join the two tables using "GeneName"
head(hour_0_hour_6_isoform_a_combined)
colnames(hour_0_hour_6_isoform_a_combined)[colnames(hour_0_hour_6_isoform_a_combined) == "gNetSignal.x"] <- "Hour_0_gNetSignal" # change column names
colnames(hour_0_hour_6_isoform_a_combined)[colnames(hour_0_hour_6_isoform_a_combined) == "gNetSignal.y"] <- "Hour_6_gNetSignal"
colnames(hour_0_hour_6_isoform_a_combined)[colnames(hour_0_hour_6_isoform_a_combined) == "gIsWellAboveBG.x"] <- "Hour_0_gIsWellAboveBG"
colnames(hour_0_hour_6_isoform_a_combined)[colnames(hour_0_hour_6_isoform_a_combined) == "gIsWellAboveBG.y"] <- "Hour_6_gIsWellAboveBG"
head(hour_0_hour_6_isoform_a_combined) # show new column names
```
Log Transform gNetSignal values:
```{r}
max(hour_0_hour_6_isoform_a_combined$Hour_0_gNetSignal)
min(hour_0_hour_6_isoform_a_combined$Hour_0_gNetSignal)
max(hour_0_hour_6_isoform_a_combined$Hour_6_gNetSignal)
min(hour_0_hour_6_isoform_a_combined$Hour_6_gNetSignal)
# since spread of values is really great for gNetSignal variable, we are going to log transform these columns.
hour_0_hour_6_isoform_a_combined= hour_0_hour_6_isoform_a_combined %>% mutate(log2_Hour_0_gNetSignal = log2(Hour_0_gNetSignal), log2_Hour_6_gNetSignal = log2(Hour_6_gNetSignal)) # log transform these values
head(hour_0_hour_6_isoform_a_combined)
```
Calculate Signal Difference:
```{r}
# Making a column that contains the difference between the log2 values. We are going to subtract the values from the log normalized values at each time point since subtracting logs is equivalent to the fold change. For example, if log2(later time point) - log2(earlier time point) = 1, then this corresponds to a 2-fold increase in gene expression between the early and late stages.
hour_0_hour_6_isoform_a_combined= hour_0_hour_6_isoform_a_combined %>% mutate(difference = log2_Hour_6_gNetSignal - log2_Hour_0_gNetSignal)
head(hour_0_hour_6_isoform_a_combined)
```
Plotting Density:
```{r}
# Plotting density to show spread of log transformed gNetSignals as well as differences.
plot(density(hour_0_hour_6_isoform_a_combined$log2_Hour_0_gNetSignal))
plot(density(hour_0_hour_6_isoform_a_combined$log2_Hour_6_gNetSignal))
plot(density(hour_0_hour_6_isoform_a_combined$difference)) # As seen in this density plot containing data for both hour 0 and hour 6, the data appears more normalized, indicating that our normalization was successful to at least a reasonable extent.
```
BiPlot for hour 0 and hour 6 data expression levels:
```{r}
#Plot of hour 0 vs. hour 6 gNetSignals. All points that show a 1.5 fold increase between samples are highlighted in red. We deem these points differentially expressed.
plot(hour_0_hour_6_isoform_a_combined$log2_Hour_0_gNetSignal, hour_0_hour_6_isoform_a_combined$log2_Hour_6_gNetSignal, xlab="Hour 0 expression", ylab="Hour 6 expression", col=ifelse(hour_0_hour_6_isoform_a_combined$log2_Hour_0_gNetSignal/hour_0_hour_6_isoform_a_combined$log2_Hour_6_gNetSignal >= 1.5, "red", ifelse(hour_0_hour_6_isoform_a_combined$log2_Hour_6_gNetSignal/hour_0_hour_6_isoform_a_combined$log2_Hour_0_gNetSignal >= 1.5, "red", "black")))
abline(a=0, b=1, col= "purple")
```
This plot shows hour 0 vs hour 6 expression levels for all genes in our data set. We have highlighted "differentially expressed" genes in red. The criteria for being differentially expressed is a fold-difference of 1.5 or more between hour 0 and hour 6 expression levels. Based on this criteria, we ended up with two distinct clusters of differentially expressed genes, those that have significantly higher expression in hour 6 and those that have significantly higher expression in hour 0.
```{r}
#Here we subsetted the original data set to obtain a subset of only differentially expressed genes (genes that show a 1.5 fold increase between samples).
hour_0_hour_6_isoform_a_combined_diff_expressed = hour_0_hour_6_isoform_a_combined[ which (hour_0_hour_6_isoform_a_combined$log2_Hour_0_gNetSignal/hour_0_hour_6_isoform_a_combined$log2_Hour_6_gNetSignal >= 1.5 | hour_0_hour_6_isoform_a_combined$log2_Hour_6_gNetSignal/hour_0_hour_6_isoform_a_combined$log2_Hour_0_gNetSignal >= 1.5), ]
hour_0_hour_6_isoform_a_combined_diff_expressed
```
```{r}
#This creates a subset of differentially expressed genes which show at least a 1.5 fold increase in expression at hour 6 compared to hour 0 (positively differentially expressed).
hour_0_hour_6_isoform_a_combined_pos_diff_expressed = hour_0_hour_6_isoform_a_combined_diff_expressed[ which (hour_0_hour_6_isoform_a_combined_diff_expressed$log2_Hour_6_gNetSignal/hour_0_hour_6_isoform_a_combined_diff_expressed$log2_Hour_0_gNetSignal >= 1.5), ]
```
```{r}
#This creates a subset of differentially expressed genes which show at least a 1.5 fold increase in expression at hour 0 compared to hour 6 (negatively differentially expressed).
hour_0_hour_6_isoform_a_combined_neg_diff_expressed = hour_0_hour_6_isoform_a_combined_diff_expressed[ which (hour_0_hour_6_isoform_a_combined_diff_expressed$log2_Hour_0_gNetSignal/hour_0_hour_6_isoform_a_combined_diff_expressed$log2_Hour_6_gNetSignal >= 1.5), ]
```
Statistical Prediction Model:
```{r}
#Creating the training and testing data sets for all differentially expressed genes.
set.seed(99)
train3 <- hour_0_hour_6_isoform_a_combined_diff_expressed %>% sample_frac(size = 0.5)
test3 <- hour_0_hour_6_isoform_a_combined_diff_expressed %>% setdiff(train3)
```
```{r}
#Creating the training and testing data sets for positively differentially expressed genes.
set.seed(99)
train1 <- hour_0_hour_6_isoform_a_combined_pos_diff_expressed %>% sample_frac(size = 0.5)
test1 <- hour_0_hour_6_isoform_a_combined_pos_diff_expressed %>% setdiff(train1)
```
```{r}
#Creating the training and testing data sets for negatively differentially expressed genes.
set.seed(99)
train2 <- hour_0_hour_6_isoform_a_combined_neg_diff_expressed %>% sample_frac(size = 0.5)
test2 <- hour_0_hour_6_isoform_a_combined_neg_diff_expressed %>% setdiff(train2)
```
```{r}
#Combinding both training sets and testing sets for positive and negative differentially expressed genes.
train_total = rbind(train1, train2)
test_total = rbind(test1, test2)
```
```{r}
#Plotting the combined training data for all differentially expressed genes using hour 0 and hour 6 expression levels. We also plotted two regression lines based on the training data, one for the positive differentially expressed genes and one for the negative differentially expressed genes.
plot(train_total$log2_Hour_0_gNetSignal, train_total$log2_Hour_6_gNetSignal, xlab="Hour 0 expression", ylab="Hour 6 expression", col= "red")
abline(lm(train1$log2_Hour_6_gNetSignal~train1$log2_Hour_0_gNetSignal))
abline(lm(train2$log2_Hour_6_gNetSignal~train2$log2_Hour_0_gNetSignal))
```
This plot shows the combined training data for all differentially expressed genes using hour 0 and hour 6 expression levels. The two plotted regression lines are based on the correlation between hour 6 and hour 0 expression levels from the training data. One of the regression lines is for the positive differentially expressed genes and the other is for the negative differentially expressed genes.
```{r}
#Plotting the combined testing data for all differentially expressed genes using hour 0 and hour 6 expression levels. We used the same regression lines from our training data analysis to show how closely these training set regression lines fit our testing data sets.
plot(test_total$log2_Hour_0_gNetSignal, test_total$log2_Hour_6_gNetSignal, xlab="Hour 0 expression", ylab="Hour 6 expression", col= "red")
abline(lm(train1$log2_Hour_6_gNetSignal~train1$log2_Hour_0_gNetSignal))
abline(lm(train2$log2_Hour_6_gNetSignal~train2$log2_Hour_0_gNetSignal))
```
This plot shows the combined testing data for all differentially expressed genes using hour 0 and hour 6 expression levels. The two plotted regression lines are the same lines established from out training data set. Even though these regression lines are not from our testing data set, they fit the testing data very well and appear to have low residuals. This is an indication that our regression lines from our training data are a good fit for positively and negatively differentially expressed genes.
```{r}
#Getting a summary of the linear model for the positively differentially expressed training data set to obtain the regression line coefficents.
lin_mod_1 = lm(train1$log2_Hour_6_gNetSignal~train1$log2_Hour_0_gNetSignal)
lin_mod_1_summary = summary(lin_mod_1)
lin_mod_1_summary$coefficients
```
```{r}
#Calculating the residuals for the positively differentially expressed test data set using the coefficients from the training data linear model.
test1$residual = test1$log2_Hour_6_gNetSignal - (1.5909547*test1$log2_Hour_0_gNetSignal + 0.7052012)
test1$residual
```
```{r}
#Getting a summary of the linear model for the negatively differentially expressed training data set to obtain the regression line coefficents.
lin_mod_2 = lm(train2$log2_Hour_6_gNetSignal~train2$log2_Hour_0_gNetSignal)
lin_mod_2_summary = summary(lin_mod_2)
lin_mod_2_summary$coefficients
```
```{r}
#Calculating the residuals for the negatively differentially expressed test data set using the coefficients from the training data linear model.
test2$residual = test2$log2_Hour_6_gNetSignal - (0.3069249*test2$log2_Hour_0_gNetSignal + 3.7224380)
test2$residual
```
```{r}
#Calculating SSE for the residuals from the combined differentially expressed genes to determine how well our test data fits our training linear model.
test_total = rbind(test1, test2)
SSE = sum((test_total$residual)^2)
SSE
```
**Appendix**
*Group Member's Contribution*
All group members were greatly involved in devising and planning this project, which involved: reading the scientific literature related to the data sets utilized in this study, identifying key points of analysis, and formulating a plan. Nityam Rathi and Yogi Raghav initially mined through the raw datasets (GSE6711) and determined which data sets to use to fulfill our goal. Nityam and Yogi then read pertinent literature on microarray data to identify which variables in the massive data sets could be used to calculate differential gene expression and for subsequent statistical sets. After cutting the data sets to include only our variables of interest (gNetSignal and gIsWellAboveBG), Nityam and Yogi performed data wrangling and appropriate normalization techniques to create a final data set upon which the density plots are shown. This abridged dataset was then used for the differential gene expression analysis and statistal learning/regression models by Kyle and Matt. Nityam and Yogi provided necessary instruction and guidance for subsequent data visualization and statistical learning tests. Matt and Kyle subsetted the data into positively, negatively, and overall differentialluy expressed genes. Matt and Kyle created the expression biplot for hour 0 and hour 6 expression levels to visualize differentially expressed genes. Matt and Kyle also split the data into training and testing data sets for development of a statistical learning model. From this, Matt and Kyle developed a plot and two regression lines for the differentially expressed genes from the training data. They then plotted the testing data with the training data regression lines to see how well these lines fit the testing data. Matt and Kyle then calculated the residuals and SSE for the testing observations relative to the respective training regression lines and found a relatively low SSE, indicating that the model was a good fit.
*Files Used for Group Project*
*GSM154262.txt
*GSM154263.txt
*GSM154264.txt
*GSM154277.txt
*GSM154278.txt
*GSM154279.txt
*List of individuals' names and datasets used for part 1*
*Nityam Rathi- 'bad_drivers' data set
*Yogi Raghav- 'drug_use' data set
*Kyle Coleman- 'airline-saftey' data set
*Matt Eckelmeyer- 'drug_use' data set