Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anova #194

Merged
merged 5 commits into from
Dec 10, 2024
Merged

anova #194

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ episodes:
- r-intro-markdown-version-2.Rmd
- github-and-you.Rmd
- plot-with-tidyplot.Rmd
- anova.Rmd

# Information for Learners
learners:
Expand Down
176 changes: 176 additions & 0 deletions episodes/anova.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
---
title: 'anova'
teaching: 10
exercises: 2
---

:::::::::::::::::::::::::::::::::::::: questions

- How do you perform an ANOVA?
- What even is ANOVA?

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: objectives

- Explain how to run an analysis of variance on models
- Explain the requisites for runnin an ANOVA
- Explain what an ANOVA is
::::::::::::::::::::::::::::::::::::::::::::::::

## Introduction

Studying the length of penguin flippers, we notice that
there is a difference between the average length between
three different species of penguins:

```{r}
library(tidyverse)
library(palmerpenguins)
penguins %>%
group_by(species) %>%
summarise(mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE))
```

If we only had two groups, we would use a t-test to determine
if there is a significant difference between the two groups. But
here we have three. And when we have three, or more, we use
the ANOVA-method, or rather the `aov()` function:

```{r}
aov(flipper_length_mm ~ species, data = penguins) %>%
summary()
```
We are testing if there is a difference in flipper_length_mm
when we explain it as a function of species. Or, in other words,
we analyse how much of the variation in flipper length is
caused by variation between the groups, and how much is caused
by variation within the groups. If the difference between those
to parts of the variation large enough, we conclude that there is
a significant difference between the groups.

In this case, the p-value is very small, and we reject the
NULL-hypothesis that there is no difference in the variance
between the groups, and conversely that we accept the
alternative hypothesis that there is a difference.


## Are we allowed to run an ANOVA?

There are some conditions that needs to be fullfilled.

1. The observations must be independent.

In this example that we can safely assume that the length of the
flipper of a penguin is not influenced by the length of another
penguin.

2. The residuals have to be normally distributed

Typically we also test if the data is normally distributed. Let us
look at both:

Is the data normally distributed?

```{r}
penguins %>%
ggplot(aes(x=flipper_length_mm)) +
geom_histogram() +
facet_wrap(~species)
```
That looks reasonable.

And the residuals?

```{r}
aov(flipper_length_mm ~ species, data = penguins)$residuals %>%
hist(.)
```
That looks fine - if we want a more specific test, those exists,
but will not be covered here.

3. Homoskedacity

A weird name, it simply means that the variance in the different
groups are more or less the same. We can calculate the variance
and compare:

```{r}
penguins %>%
group_by(species) %>%
summarise(variance = var(flipper_length_mm, na.rm = TRUE))
```

Are the variances too different? As a rule of thumb, we have a
problem if the largest variance is more than 4-5 times larger
than the smallest variance. This is OK for this example.

If there is too large difference in the size of the three groups,
smaller differences in variance can be problematic.

:::: solution

## More specific methods exist.

There are probably more than than three tests for homoskedacity, here
are three:

Fligner-Killeen test:

```{r}
fligner.test(flipper_length_mm ~ species, data = penguins)
```

Bartlett's test:

```{r}
bartlett.test(flipper_length_mm ~ species, data = penguins)
```
Levene test:
```{r}
library(car)
leveneTest(flipper_length_mm ~ species, data = penguins)
```
For all three tests - if the p-value is >0.05, there is a significant
difference in the variance - and we are not allowed to use the
ANOVA-method. In this case we are on the safe side.

::::

### But where is the difference?

Yes, there is a difference between the average flipper length
of the three species. But that might arise from one of the species
having extremely long flippers, and there not being much
difference between the two other species.

So we do a posthoc analysis to confirm where the diffrences
are.

The most common is the tukeyHSD test, HSD standing for
"Honest Significant Differences":

```{r}
aov_model <- aov(flipper_length_mm ~ species, data = penguins)
TukeyHSD(aov_model)
```
We get the estimate of the pair-wise differences and
lower and upper 95% confidence intervals for those differences.


Alternativer:
Normalfordelte data og homogene variansforhold:
Tukey's HSD (generelt bedste valg).
Bonferroni (hvis meget konservativ tilgang ønskes).
Ikke-homogene variansforhold:
Games-Howell (mest robust).
Wilcoxon (ikke-parametrisk).
Fokuseret på en kontrolgruppe:
Dunnett's test.

::::::::::::::::::::::::::::::::::::: keypoints

- Use `.md` files for episodes when you want static content

::::::::::::::::::::::::::::::::::::::::::::::::

22 changes: 1 addition & 21 deletions learners/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -1569,14 +1569,6 @@ _Dimensions:_ Rows: 178 Columns: 14

[Source](learners/data.md#wine_9)^9^

@misc{misc_wine_109,
author = {Aeberhard,Stefan and Forina,M.},
title = {{Wine}},
year = {1991},
howpublished = {UCI Machine Learning Repository},
note = {{DOI}: https://doi.org/10.24432/C5PC7J}
}

[Download](https://raw.githubusercontent.com/KUBDatalab/R-toolbox/main/episodes/data/wine.data)

:::: spoiler
Expand All @@ -1602,7 +1594,7 @@ _Dimensions:_ Rows: 178 Columns: 14

Absorbance is measured as the sum of absorbance-units at 420, 520 and 620 nm (blue, green and red light respectively, measuring the yellow, red, and blue colors of the wine.)

Hue is measured as absorbance at 420 nm divided by absorbance at 520 nm.
Hue is measured as absorbance at 420 nm divided by absorbance at 520 nm.

OD280/OD315 is measured as absorbance at 280 nm divided by absorbance at 315 nm.

Expand All @@ -1612,18 +1604,6 @@ OD280/OD315 is measured as absorbance at 280 nm divided by absorbance at 315 nm.

## References













<a id="rosner_1">1</a>: Rosner, Bernard A. Fundamentals of Biostatistics, 7/e, International Edition, 2011 ISBN: 9780538735896. https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9780538733496&token

der er også guf her https://www.doc88.com/p-5925003681540.html
Expand Down
Loading