more_exercises.Rmd

---
title: "Extra Data, Use-cases and Exercises"
author: "Mark Dunning"
date: '`r format(Sys.time(), "Last modified: %d %b %Y")`'
output: 
  html_notebook: 
    toc: yes
    toc_float: yes
    css: stylesheets/styles.css
editor_options: 
  chunk_output_type: inline
---
<img src="images/logo-sm.png" style="position:absolute;top:40px;right:10px;" width="200" />


# The COSMIC Dataset

We will illustrate some of the main functions from tidyverse with example data from [COSMIC](https://cancer.sanger.ac.uk/cosmic). An example file is freely available, but you will need to register for full-size file. 

`download.file` is an R *function* for downloading a file into your *working directory* (the directory where R is currently looking to open files from, and save files to).

```{r eval=FALSE}
## create a more_data folder

dir.create("more_data",showWarnings = FALSE)

download.file("https://github.com/sheffield-bioinformatics-core/r-online/raw/master/more_data/CosmicMutantExport.tsv",destfile = "more_data/CosmicMutantExport.tsv")
```


## Exercise1

<div class="exercise">

- Which function from `readr` would you use the read the file `CosmicMutantExport.tsv` into R?
- Import the file `CosmicMutantExport.tsv` into R
- How many rows and columns are in the data frame you create

</div>

```{r echo=FALSE}
library(readr)
cosmic <- readr::read_tsv("more_data//CosmicMutantExport.tsv")
cosmic
```

## Column names containing spaces

When using `select` we can use the column name without quotation marks to print that column. Column names that contain a space present a bit of an issue for R:-

```{r eval=FALSE}
library(dplyr)
select(cosmic, Gene name)
```

This can be avoiding by putting the desired column name inside the backtick symbols.

```{r}
library(dplyr)
select(cosmic, `Gene name`)
```


```{r}
select(cosmic, `Gene name`, `Accession Number`)
```

```{r}
select(cosmic, `Gene name`:`Sample name`)
```

## Exercise2

Practice using the `select` function to identify particular columns in the dataset. Remember some of the "helper functions" that are available when using select.

<div class="exercise">

- Select the first five, and the last five columns from the table
- Select the columns that start with "Mutation"
- Select all columns that contain information on the tumour Histology. There should be four columns.

</div>


## Exercise3

The `filter` can be used to restrict a data frame to particular rows of interest

<div class="exercise">

Use `filter` to restrict the rows data to..

- Mutations that occur in the breast or prostate
- Mutations that occur in patients under 50

</div>


## Separating columns

You might have noticed that the first column of the data frame comprises both the name of a gene and it's identifier in the [Ensembl database](https://www.ensembl.org/index.html). Representing the data in such a way makes analyses such as counting the number of mutations in a particular gene more complicated than they should be. 

A function to perform this kind of data cleaning can be found in the `tidyr` package.

```{r}
## check if tidyr package is installed, and install if it isn't
if(!require("tidyr")) install.packages("tidyr")
```

The `separate` function takes a data frame as it's first argument and will split a named column into it's component parts. The names of the new columns can be specified. The character (in our case `_`) that is found in the column can also be specified, although it is usually able to detect this automatically.

```{r}
library(tidyr)
separate(cosmic, `Gene name`, into=c("SYMBOL","ENSEMBL"))
```
As usual, we will need to assign the result to a variable if we want to keep this new version of the data.

```{r}
cosmic <- separate(cosmic, 
                   `Gene name`, 
                   into=c("SYMBOL","ENSEMBL"))
```

Now we can make a barplot of how many mutations are found for each gene

```{r}
library(ggplot2)
ggplot(cosmic, aes(x = SYMBOL)) + geom_bar()
```

## Exercise 4

<div class="exercise">
- Is the age distribution of patients that have mutations in `NCOR2` or `SELP` different? Use a violin plot to find out?
- Expand the `count` code above to find the number of mutations in each sample type for the two genes
- Visualise these counts using a `geom_tile` arrangement (see below for example plot)

</div>


```{r echo=FALSE}
count(cosmic, SYMBOL, `Primary site`) %>% 
  ggplot(aes(x = SYMBOL, y=`Primary site`, fill= n)) + geom_tile()
```


## Exercise4

<div class="exercise">

- Separate the position of the mutation into chromosome start and end columns
- Arrange the rows according to chromosome and start position


# Clinical data from the TCGA project

This exercise concerns the clinical descriptions of tumours from The Cancer Genome Archive. It was previously downloaded from [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944) and has undergone some minor alterations. 

## Part1

The data can be downloaded using the following code:-


```{r}
## create a more_data folder

dir.create("more_data",showWarnings = FALSE)

## will check if the file exists before downloading

if(!file.exists("more_data/tcga_clinical.tsv")) download.file("https://github.com/sheffield-bioinformatics-core/r-online/raw/master/more_data/tcga_clinical.tsv", destfile = "more_data/tcga_clinical.tsv")
```


<div class="exercise">
**Exercise**: What function from `readr` would you use to read the file `tcga_clinical.tsv` into R? Read the file in. What are the number of rows and columns?

</div>

You should find that the data frame contains a great deal of columns; far too many to be useful. We would like to keep the columns containing the age of the patient, and the tumour stage in our analysis. Rather than opening-up the file, or `View`ing the file in RStudio, we can use a couple of helper functions to identify the relevant column names.

<div class="exercise">
**Exercise**: Use the `select` function in conjunction with `contains` and `starts_with` to identify columns that have `age` or `stage` in their name. The code should look like the following (you will need to fill-in the dots).

</div>

```{r eval=FALSE}
# Complete the code by replacing the ... 

select(..., contains("...."))
select(..., starts_with("...."))

```


<div class="exercise">
**Exercise:** Use the `select` function to create a new data frame that contains the following columns. **These are not the actual columns names**
  - Tumour site
  - Race
  - Gender
  - Age at diagnosis
  - Dead / Alive Status
You can add extra columns if you wish

**See below for example output**
</div>

```{r message=FALSE, echo=FALSE,warning=FALSE}
library(tidyverse)
clin <- readr::read_tsv("../raw_data/tcga_clinical.tsv")
data <- select(clin, 
                tumor_tissue_site,
                race,
                gender,
                age_at_initial_pathologic_diagnosis,
                vital_status)
data
```


<div class="exercise">
**Exercise:** Use the `dplyr` function called `count` to tabulate how which sites are included in the data. Re-arrange the output from `count` using `arrange` to determine the most common type of cancer in the dataset.

**See below for example output**
</div>


```{r echo=FALSE}
count(data, tumor_tissue_site)

```

<div class="exercise">
**Exercise**: Not all samples have an entry for tumour type. Use the `filter` function to create a table with valid entries for `tumor_tissue_site`. Create a barplot to show display the number of occurences of each tumour type

HINT: An easy way to make the labels on the x-axis more legible is to use the `coord_flip` function

```{r eval=FALSE}
ggplot(data, aes(x=...)) + geom_bar() + coord_flip()
```

**See below for example output**
</div>

```{r echo=FALSE}
  filter(data,!is.na(tumor_tissue_site)) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  ggplot(aes(x = tumor_tissue_site)) + geom_bar() + coord_flip()
```


## Part2

We would like to visualise the age of diagnosis, and eventually compare between different disease types, The code we might think to use initially could look like:-

```{r}
## assuming your filtered clinical data is called data

ggplot(data, aes(x = age_at_initial_pathologic_diagnosis)) + geom_density()
```
This doesn't look like the desired output though. If we re-visit the data frame and print the "age" column we notice that the entries in the column are stored as "chr". i.e. characters or text


```{r}
select(data, age_at_initial_pathologic_diagnosis)
```

This has occurred because some entries are "`[Not Available]`" rather than a number or `NA`. As soon as R finds any text within the column, it treats everything in the column as text.

These entries can be filtered in the same manner as previously (when filtering the tissue type column), but this does not solve the problem entirely. 

```{r}
data %>% 
  filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>% 
ggplot(aes(x = age_at_initial_pathologic_diagnosis)) + geom_density()
```

We need to add an additional step which will force R to treat the data in the `age_at_initial_pathologic_diagnosis` column as numerical data. Such a conversion can be done using the `as.numeric` function and the `mutate` function can be used to modify the `age_at_initial_pathologic_diagnosis` column to contain the numeric values

<div class="exercise">
**Exercise**: Use `mutate` and `as.numeric` to convert the values in `age_at_initial_pathologic_diagnosis` into numbers. You will still need to remove the 
`[Not Available]` values beforehand. Now try and create the density plot.
</div>

```{r echo=FALSE}
data %>% 
  filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>% 
  mutate(age_at_initial_pathologic_diagnosis = as.numeric(age_at_initial_pathologic_diagnosis)) %>% 
ggplot(aes(x = age_at_initial_pathologic_diagnosis)) + geom_density()
```
<div class="exercise">
**Exercise**: Use the `facet_wrap` function to compare the distribution of ages between different tumour types
</div>

```{r echo=FALSE}
data %>% 
  filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>% 
    filter(tumor_tissue_site != "[Not Available]") %>% 
  mutate(age_at_initial_pathologic_diagnosis = as.numeric(age_at_initial_pathologic_diagnosis)) %>% 
 ggplot(aes(x = age_at_initial_pathologic_diagnosis)) + geom_density() + facet_wrap(~tumor_tissue_site)

```

<div class="exercise">
**Exercise**: Do any tumour types have a different age of diagnosis between males and females? Use a boxplot to find out
</div>

```{r echo=FALSE}
data %>% 
  filter(age_at_initial_pathologic_diagnosis != "[Not Available]") %>% 
    filter(tumor_tissue_site != "[Not Available]") %>% 
  mutate(age_at_initial_pathologic_diagnosis = as.numeric(age_at_initial_pathologic_diagnosis)) %>% 
 ggplot(aes(x= gender, y = age_at_initial_pathologic_diagnosis)) + geom_boxplot() + facet_wrap(~tumor_tissue_site)
```

Lets now look at gender split for each cancer type. As a first step, we can group the data by gender and tissue type and obtain counts.

```{r}
data %>% 
  group_by(tumor_tissue_site,gender) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  summarise(N = n())
```

These data are ready for plotting, but for comparisons we need to take into account the total number of each tissue type. We can create frequencies rather than absolute numbers by dividing by the total number of cases.

```{r}
data %>% 
  group_by(tumor_tissue_site,gender) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  summarise(N = n()) %>% 
  mutate(freq = N / sum(N))
```

Note that the order of the grouping is important here. If we reversed it to `gender` then `tumor_tissue_site` the frequencies would be calculated using the total of males of females.

```{r}
data %>% 
  group_by(gender,tumor_tissue_site) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  summarise(N = n()) %>% 
  mutate(freq = N / sum(N))
```


<div class="exercise">
**Exercise**: Create a plot to show the gender split in cases of each tumor type.
</div>

```{r echo=FALSE}
data %>% 
  group_by(tumor_tissue_site,gender) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  summarise(N = n()) %>% 
  mutate(freq = N / sum(N)) %>% 
  ggplot(aes(x = gender, y = freq)) + geom_col() + facet_wrap(~tumor_tissue_site)
```

<div class="exercise">
**Exercise**: Create a plot to show the proportion of patients dead or alive for each tumour type
</div>


```{r echo=FALSE}
data %>% 
  group_by(tumor_tissue_site,vital_status) %>% 
  filter(tumor_tissue_site != "[Not Available]") %>% 
  filter(vital_status != "[Not Available]") %>%
  summarise(N = n()) %>% 
  mutate(freq = N / sum(N)) %>% 
  ggplot(aes(x = vital_status, y = freq)) + geom_col() + facet_wrap(~tumor_tissue_site)
```

# Palmer Penguins

This fun example concerns a dataset available from:-

https://allisonhorst.github.io/palmerpenguins/

![](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png)

** Artwork by @allison_horst **

```{r}
## check if package already exists, and if not install from CRAN
if(!require("palmerpenguins")) install.packages("palmerpenguins")
```

Once the package is installed, we can load it and `View` the `penguins` dataset, which is included with the package.

```{r}
library(palmerpenguins)
View(penguins)
example("penguins")
```

```{r}
count(penguins, species)
```

<div class="exercise">
- Is the body mass of male penguins more than female penguins? Use a suitable plot to visualise the data.
- Is the trend consistent across different species?
- Use the `summarise` function to calculate the average body mass in males / females for different species
- Can you remove observations for which no `sex` was recorded from the dataset?
</div>

```{r}
ggplot(penguins, aes(x = sex, y = body_mass_g)) + geom_boxplot() + facet_wrap(~species)

group_by(penguins, sex,species) %>%
  summarise(Weight = mean(body_mass_g))

```

<div class="exercise">
Is the body mass related to the flipper length? Use an appropriate plot to find out
</div>

```{r}
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g, col = species)) + geom_point() + geom_smooth(method="lm")
```

# Machine Learning Examples

```{r}
if(!require("MLDataR")) install.packages("MLDataR")
```


```{r}
library(MLDataR)
```

```{r}
View(diabetes_data)
```

```{r}
ggplot(diabetes_data,aes(x = DiabeticClass, y = Age, fill=Gender)) + geom_boxplot()
```