forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter6.Rmd
320 lines (265 loc) · 11.4 KB
/
chapter6.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
## Chapter 6: Analysis of longitudinal data
### Data wrangling
The data wrangling for Exercise 6 is performed by the
``data/meet_and_repeat.R`` script in my github repository, as
specified in the instructions. It can be found
[here](https://github.com/tatuylonen/IODS-project/blob/master/data/meet_and_repeat.R).
### Analysis
Let's start by loading some libraries we might use.
```{r}
library(ggplot2)
library(GGally)
library(corrplot)
library(dplyr)
library(tidyr)
library(lme4)
# Set default figure size
knitr::opts_chunk$set(fig.width=12, fig.height=8)
```
Let's load the data sets created by the data wrangling script
(we'll do the factor conversion later).
```{r}
RATS <- read.csv("data/meet_RATS.csv", row.name=1)
RATSL <- read.csv("data/meet_RATSL.csv", row.name=1)
BPRS <- read.csv("data/meet_BPRS.csv", row.name=1)
BPRSL <- read.csv("data/meet_BPRSL.csv", row.name=1)
```
#### (1) Implement the analyses of Chapter 8 of MABS using the RATS data
Ok, so it looks like basically we need to perform the analyses we did
on Datacamp, but with the datasets swapped.
```{r}
# First, convert ID and Group fields to factors
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
# Let's look at the data
names(RATSL)
str(RATSL)
summary(RATSL)
glimpse(RATSL)
# Display the data graphically
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
```
We can see that rats in Group 1 had consistently lower weight than
those in Groups 2 and 3. However, there were only eight rats in Group
1, four in Group 2, and four in Group 3. These numbers are so low
that making any statistical analyses based on them are fraught with
uncertainty. That said, we are tasked with analyzing this data.
We can also see that in each group the rats are gaining weight as time passes.
Let's see how the picture changes if we standardize the data for each
data point (keeping in mind that the number of observations is too low
to do this with any accuracy).
```{r}
# Standardize Weight
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdWeight = (Weight - mean(Weight)) / sd(Weight)) %>%
ungroup()
str(RATSL)
# Plot again with standardized Weight
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized Weight")
```
We can now see clearly that some rats are growing faster than the
others. Where the line goes down, it means that the rat is gaining
wait slower than the mean of the group, or might even be losing
weight.
Let's now plot the group means and their error bars. We should note
that it is impossible to calculate variance with any accuracy from
just four observations, and correspondingly the magnitude of the error
bars is only a very rough estimate that shouldn't be given much
statistical significance.
```{r}
# Add mean and standard error computations. Note that the n in standard error
# computation should be the number of samples used to compute that mean,
# not the total number of samples in the whole dataset for all Group/Time
# combinations (the Datacamp exercise is in error).
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = sd(Weight) / sqrt(length(Weight))) %>%
ungroup()
# Note: In view of the warning about regrouping above, verify that groups
# are correct inside summarize
dim(RATSL)
X <- RATSL %>% group_by(Group, Time) %>% summarize(cnt=length(Weight)) %>% ungroup()
X$cnt
# Yes, this matches what we see in the earlier plots - 8, 4, 4 rats in each
# group, for each timepoint. So looks like summarize() worked correctly
# despite the weird warning.
# Let's glimpse at the data with timepoint means and standard errors.
# We leave aside the question whether standard error is reliable with
# just 4 or 8 observations in each grouping... (it ISN'T very
# reliable with so few observations)
glimpse(RATSS)
# Plot the mean profiles for each Group, with error bars
ggplot(RATSS,
aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8,0.8)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
```
Let's then check for utliers using a boxplot. (We should note, however, that
the low number of observations makes any outlier detection very unreliable.)
```{r}
# Quick check for outliers using a boxplot
RATSB <- RATSL %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise(mean=mean(Weight) ) %>%
ungroup()
glimpse(RATSB)
ggplot(RATSB, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), weeks 1-8")
```
We can see that there is a possible outlier in Group 2. However,
given that there are only four rats in this group (which is already
too little data), any variance estimate (and thus standard error and
the quartiles) is highly suspect, and thus we cannot really justify
removing any individuals from the data. They might not be outliers
at all if we had a bit more data, and being already data-deprived,
we would be even more so with only three rats.
Let's then see if there is a statistically significant difference
between the groups. We'll use the dataset we created with
the mean for each group, but for the reasons above we did not do outlier
elimination. There is a compication here, as the t test only works between
exactly two groups. For the purposes of this analysis, we'll lumk groups
2 and 3 together and compare them againt Group 1. Let's add an auxiliary
field for this.
```{r}
# Create a helper variable indicating whether the observation is in Group 1
RATSB$isGroup1 <- RATSB$Group == "1"
# Perform a t-test between being in group 1 and not being in group 1
t.test(mean ~ isGroup1, data=RATSB, var.equal=TRUE)
# Add a baseline from Time 1
RATSB$baseline <- RATS$WD1
# Fit a linear model with the mean as the response
fit <- lm(mean ~ baseline + isGroup1, data=RATSB)
fit
# Compute an analysis of variance table for the fitted model with anova()
anova(fit)
```
We can see that the data suggests the baseline (WD1) to be a very
significant predictor of the group mean weight and being in Group 1
being significant at 95% level. However, even ignoring the
unsuitability of the hypothesis testing method testing method for our
case (the observations are not independent), we had so few data points
that the means are inaccurate, and this significance prediction was
based on the inaccurately computed means. This adds to uncertainty
beyond what was considered by hypothesis testing, and we probably
should not consider being in Group 1 as significant given the baseline
weight.
#### (2) Implement the analyses of Chapter 9 of MABS uising the BPRS data
Let's then look at analyzing the BPRS (brief psychiatric rating scale)
data. We first convert certain fields to factors, look at the data,
and plot the data.
In analyzing this data, we must first note that **the subject field
does not uniquely identify the person**. Instead, the data is
structured so that the ``subject`` field only identifies a subject
within each treatment regime (logically one cannot administer the two
treatments to be compared to the same person simultaneously in this
kind of experiment). Also, the Datacamp exercises say the dataset
contains data for 40 male subjects, but there are only 20 distinct
values in the ``subject`` field. For this purpose, I added a
``uniqueSubject`` field in the data wrangling phase to uniquely
identify each person, as such unique identification will be needed for
mixed effects modeling as a random factor. Our analysis will use this
field instead of ``subject`` to identify the person, because
``subject`` does not uniquely identify the individual!
```{r}
# First, convert the treatment and subject fields to factors
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
# This field was added in data wrangling to uniquely identify each subject,
# as discussed above and in the comments in the data wrangling script.
BPRSL$uniqueSubject <- factor(BPRSL$uniqueSubject)
# Let's look at the data
names(BPRSL)
str(BPRSL)
summary(BPRSL)
glimpse(BPRSL)
ggplot(BPRSL, aes(x=week, y=bprs, linetype=uniqueSubject)) +
geom_line() +
scale_linetype_manual(values=rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller=label_both) +
scale_y_continuous(name="BPRS", limits=c(min(BPRSL$bprs), max(BPRSL$bprs))) +
theme(legend.position="top")
```
Visually, it looks like there is a downward trend with treatment 1.
The situation looks less clear for treatment 2.
Let's then try a linear regression.
```{r}
# Create a linear regression model
BPRS_reg <- lm(bprs ~ week + treatment, data=BPRSL)
# Print a summary of the model
summary(BPRS_reg)
```
In this model, it looks like the improvement with time is
statistically significant, but this model does not show statistical
significance for the treatment regime. However, we should note that
the observations are not independent as we have repeated observations
from the same individuals, so the assumptions of the significance test
were violated and it cannot really be relied upon here.
Let's then try to construct a mixed model for the same variables,
using ``uniqueSubject`` as a random effect.
```{r}
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | uniqueSubject), data=BPRSL,
REML=FALSE)
# Print the summary of the model
summary(BPRS_ref)
```
Let's then construct the model so that we treat both treatment and
``uniqueSubject`` as random effects.
```{r}
# Create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | uniqueSubject),
data=BPRSL, REML=FALSE)
# Print a summary of the model
summary(BPRS_ref1)
# Perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
```
It would seem that **the second model fits the data better than the
first model at the 95% significance level** (p=0.026).
Let's plot again the original BPRS values.
```{r}
ggplot(BPRSL, aes(x=week, y=bprs, group=uniqueSubject)) +
geom_line(aes(linetype=treatment)) +
scale_x_continuous(name="Week") +
facet_grid(. ~ treatment, labeller=label_both) +
scale_y_continuous(name="BPRS") +
theme(legend.position="top")
```
Let's then plot the fit using the second model (``BPRS_ref1``) for comparison.
```{r}
# Compute the fitted values and add them as a column in RATSL
Fitted <- fitted(BPRS_ref1, BPRSL)
# Add a new column for the fitted data
BPRSL <- mutate(BPRSL, Fitted=Fitted)
# Draw a plot of BPRSL with the observed bprs values for each subject
ggplot(BPRSL, aes(x=week, y=Fitted, group=uniqueSubject)) +
geom_line(aes(linetype=treatment)) +
scale_x_continuous(name="Week") +
facet_grid(. ~ treatment, labeller=label_both) +
scale_y_continuous(name="Fitted BPRS") +
theme(legend.position="top")
```
I have to say that using a non-unique subject idenfier in the BPRS
dataset was a rather devious twist in the exercise.