forked from dzchilds/eda-for-bio
-
Notifications
You must be signed in to change notification settings - Fork 0
/
2_02_getting_data.Rmd
105 lines (62 loc) · 12.4 KB
/
2_02_getting_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# Working directories and data files
## Introduction
R is able to access data from a huge range of different data storage formats and repositories. With the right tools, we can use R to pull in data from various data bases, proprietary storage formats (e.g. Excel), online web sites, or plain old text files. We aren't going to evaluate the many packages and functions used to pull data into R---a whole book could be written about this topic alone. Instead, we're going to examine the simplest method for data import: reading in data from a text file. We'll also briefly look at how to access data stored in packages.
## Data files: the CSV format
Just about every piece of software that stores data in some kind of table-like structure can export those data to a CSV file. The CSV acronym stands for "Comma Separated Values". CSV files are just ordinary text files. The only thing that makes them a CSV file is the fact that they store data in a particular format. This format is very simple: each row of a CSV file corresponds to a row in the data, and each value in a row (corresponding to a different column) is separated by a comma. Here is what the artificial data from the last chapter looks like in CSV format:
```{r, echo=FALSE}
trt <- rep(c("Control","Fertilser"), each = 3)
bms <- c(284, 328, 291, 956, 954, 685)
div <- c(8, 12, 11, 8, 4, 5)
experim.data <- data.frame(trt, bms, div)
write.csv(experim.data, row.names = FALSE)
```
The first line contains the variable names, with each name separated by a comma. It's usually a good idea to include the variable names in a CSV file, though this is optional. After the variable names, each new line is a row of data. Values which are not numbers have double quotation marks around them; numeric values lack these quotes. Notice that this is the same convention that applies to the elements of atomic vectors. Quoting non-numeric values is actually optional, but reading CSV files into R works best when non-numeric values are in quotes because this reduces ambiguity.
### Exporting CSV files from Excel
Those who work with small or moderate data sets (i.e. 100s-1000s of lines) often use Excel to manage and store their data. There are good reasons for why this isn't necessarily a sensible thing to do---for example, Excel has a nasty habit of "helpfully" formatting data values. Nonetheless, Excel is a ubiquitous and convenient tool for data management, so it's important to know how to pull data into R from Excel. It is possible to read data directly from Excel into R, but this way of doing things can be error prone for an inexperienced user and requires us to use an external package (the **readxl** package is currently the best option). Instead, the simplest way to transfer data from Excel to R is to first export the relevant worksheet to a CSV file, and then import this new file using R's standard file import tools.
We'll discuss the import tools in a moment. The initial export step is just a matter of selecting the Excel worksheet that contains the relevant data, navigating to `Save As...`, choosing the `Comma Separated Values (.csv)`, and following the familiar file save routine. That's it. After following this step our data are free of Excel and ready to be read into R.
```{block, type="warning"}
#### Always check your Excel worksheet
Importing data from Excel can turn into a frustrating process if we're not careful. Most problems have their origin in the Excel worksheet used to store the data, rather than R. Problems usually arise because we haven't been paying close attention to a worksheet. For example, imagine we're working with a very simple data set, which contains three columns of data and a few hundred rows. If at some point we accidentally (or even intentionally) add a value to a cell in the forth column, Excel will assume the fourth column is "real" data. When we then export the worksheet to CSV, instead of the expected three columns of data, we end up with four columns, where most of the fourth column is just missing information. This kind of mistake is surprisingly common and is a frequent source of confusion. The take-home message is that when Excel is used to hold raw data should be used to do just that---the worksheet containing our raw data should hold only that, and nothing else.
```
## The working directory
Before we start worrying about data import, we first need to learn a bit about how R searches for the files that reside on our computer's hard drive. The key concept is that of the "working directory". A "directory" is just another word for "folder". The working directory is simply a default location (i.e. a folder) R uses when searching for files. The working directory must always be set, and there are standard rules that govern how this is chosen when a new R session starts. For example, if we start R by double clicking on an script file (i.e. a file with a ".R" extension), R will typically set the working directory to be the location of the R script. We say typically, because this behaviour can be overridden.
There's no need to learn the rules for how the default working directory is chosen, because we can always use R/RStudio to find out which folder is currently set as the working directory. Here are a couple of options:
1. When RStudio first starts up, the __Files__ tab in the bottom right window shows us the contents (i.e. the files and folders) of the working directory. Be careful though, if we use the file viewer to navigate to a new location this does not change the working directory.
2. The `getwd` function will print the location to the working directory to the Console. It does this by displaying a __file path__. If you're comfortable with file paths then the output of `getwd` will make perfect sense. If not, it doesn't matter too much. Use the RStudio __Files__ tab instead.
Why does any of this matter? We need to know where R will look for files if we plan to read in data. Fortunately, it's easy to change the working directory to a new location if we need to do this:
1. Using RStudio, we can set the working directory via the `Session > Set Working Directory... > Choose Directory...` menu. Once this menu item is selected we're presented with the standard file/folder dialogue box to choose the working directory.
2. Alternatively, we can use a function called `setwd` at the Console, though once again, we have to be comfortable with file paths to use this. Using RStudio is easier, so we won't demonstrate how to use `setwd`.
## Importing data with `read.csv`
Now that we know roughly how a CSV file is formatted, and where R will look for such files, we need to understand how to read them into R. The standard R function for reading in a CSV file is called `read.csv`. There are a few other options (e.g. `read_csv` from the **readr** package), but we'll use `read.csv` because it's part of the base R distribution, which means we can use it without relying on an external package.
The `read.csv` function does one thing: given the location of a CSV file, it will read the data into R and return it to us as a data frame. There are a couple of different strategies for using `read.csv`. One is considered good practice and is fairly robust. The second is widely used, but creates more problems than it solves. We'll discuss both, and then explain why the first strategy is generally better than the second.
#### Strategy 1---set the working directory first
Remember, the working directory is the default location used by R to search for files. This means that if we set the working directory to be wherever our data file lives, we can use the `read.csv` function without having to tell R where to look for it. Let's assume our data is in a CSV file called "my-great-data.csv". We should be able to see "my-great-data.csv" in the __Files__ tab in RStudio if the working directory is set to its location. If we can't see it there, the working directory still needs to be set (e.g. via `Session > Set Working Directory... > Choose Directory...`).
Once we've set the working directory to this location, reading the "my-great-data.csv" file into R is simple:
```{r, eval=FALSE}
my_data <- read.csv(file = "my-great-data.csv", stringsAsFactors = FALSE)
```
R knows where to find the file because we first set the working directory to be the location of the file. If we forget to do this R will complain and throw an error. We have to assign the output a name so that we can actually use the new data frame (`my_data` in this example), otherwise all that will happen is the resulting data frame is read in and printed to the Console.
#### Strategy 2---use the full path to the CSV file
If we are comfortable with "file paths" then `read.csv` can be used without bothering to set the working directory. For example, if we have the CSV file called "my-great-data.csv" on the in a folder called "r_data", then on a Unix machine we might read it into a data frame using something like:
```{r, eval=FALSE}
my_data <- read.csv(file = "~/r_data/my-great-data.csv", stringsAsFactors = FALSE)
```
When used like this, we have to give `read.csv` the full path to the file. This assumes of course that we understand how to construct a file path---the details vary depending on the operating system.
#### Why use the first strategy?
Both methods get to the same end point: after running the code at the Console we should end up with an object called `my_data` in the global environment, which is a data frame containing the data in the "my-great-data.csv" file. So why should we prefer
1. Many novice R users with no experience of programming struggle with file paths, leading to a lot of frustration and wasted time trying to specify them. The first method only requires us to set the working directory with RStudio and know the name of the file we want to read in. There's no need to deal with file paths.
2. The second strategy creates problems when we move our data around or work on different machines, as the file paths will need to be changed in each new situation. The first strategy is robust to such changes. For example, if we move all our data to a new location, we just have to set the working directory to the new location and our R code will still work.
## Importing data with RStudio (Avoid this!)
It is also possible to import data from a CSV file into R using RStudio. The steps are as follows:
1. Click on the __Environment__ tab in the top right pane of RStudio
2. Select `Import Dataset > From Text File...`
3. Select the CSV file to read into R and click Open
4. Enter a name (no spaces allowed) or stick with the default and click Import
We're only pointing out this method because new users are often tempted to use it---we **do not** recommend it. Why? It creates the same kinds of problems as the second strategy discussed above. All RStudio does generate the correct usage of a function called `read_csv` (from the **readr** package) and evaluate this at the Console. The code isn't part of a script so we have to do this every time we want to work with the data file. It's easy to make a mistake using this approach, e.g. by accidentally misnaming the data frame or reading in the wrong data. It may be tempting to copy the generated R code into a script. However, we still have the portability problem outlined above to deal with. Take our word for it. The RStudio-focussed way of reading data into R just creates more problems than it solves. Don't use it.
## Package data
Remember what Hadley Wickam said about packages? "... <Packages> include reusable R functions, the documentation that describes how to use them, __and sample data__." Many packages come with one or more sample data sets. These are very handy, as they're used in examples and package vignettes. We can use the `data` function to get R to list the data sets hiding away in packages:
```{r, eval=FALSE}
data(package = .packages(all.available = TRUE))
```
The mysterious `.packages(all.available = TRUE)` part of this generates a character vector with the names of all the installed packages in it. If we only use `data()` then R only lists the data sets found in a package called `datasets`, and in packages we have loaded and attached in the current R session using the `library` function.
The `datasets` package is part of the base R distribution. It exists only to store example data sets. The package is automatically loaded when we start R, i.e. there's no need to use `library` to access it, meaning any data stored in this package can be accessed every time we start R. We'll use a couple of data sets in the `datasets` package later to demonstrate how to work with the **dplyr** and **ggplot2** packages.