forked from dzchilds/eda-for-bio
-
Notifications
You must be signed in to change notification settings - Fork 0
/
1_09_using_packages.Rmd
106 lines (67 loc) · 14.9 KB
/
1_09_using_packages.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
# Packages
## The R package system {#package-system}
The R package system is probably the most important single factor driving increased adoption of R among quantitatively-minded scientists. Packages make it very easy to extend the basic capabilities of R. In [his book](http://r-pkgs.had.co.nz) about R packages Hadley Wickam says,
> Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.
An R package is just a collection of folders and files in a standard, well-defined format. They bundle together computer code, data, and documentation in a way that is easy to use and share with other users. The computer code might all be R code, but it can also include code written in other languages. Packages provide an R-friendly interface to use this "foreign" code without the need to understand how it works.
The base R distribution it comes with quiet a few pre-installed packages. These are "mature" packages that implement widely used statistical and plotting functionality. These base R packages represent a very small subset of the available R packages. The majority of these are hosted on a network of web servers around the world collectively know as [CRAN](http://cran.r-project.org). This network---known as a repository---is the same one we used to download the base R distribution in the [Get up and running with R and RStudio] chapter. CRAN stands for the Comprehensive R Archive Network, pronounced either "see-ran" or "kran". CRAN is a fairly spartan web site, so it's easy to navigate.
When we [navigate to CRAN](http://cran.r-project.org) we see about a dozen links of the right hand side of the home page. Under the _Software_ section there is a link called [Packages](http://cran.r-project.org/web/packages/). Near the top of this page there is a link called [Table of available packages, sorted by name](http://cran.r-project.org/web/packages/available_packages_by_name.html) that points to a very long list of all the packages on CRAN. The column on the left shows each package name, followed by a brief description of what the package does on the right. There are a huge number of packages here (over 12000 at the time of writing).
## Task views
A big list of packages presented like is overwhelming. Unless we already know the name of the package we want to investigate, it's very hard to find anything useful by scanning the "all packages" table. A more user-friendly view of many R packages can be found on the [Task Views](http://cran.r-project.org/web/views/) page (the link is on the left hand side, under the section labelled _CRAN_). A Task View is basically a curated guide to the packages and functions that are useful for certain disciplines. The Task Views page shows a list of these discipline-specific topics, along with a brief description.
The [Environmentrics](http://cran.r-project.org/web/views/Environmetrics.html) Task View maintained by Gavin Simpson contains information about using R to analyse ecological and environmental data. It is not surprising this Task View exists. Ecologists and environmental scientists are among the most enthusiastic R users. This view is a good place to start looking for a new package to support a particular analysis in a future project. The [Experimental Design](http://cran.r-project.org/web/views/ExperimentalDesign.html), [Graphics](http://cran.r-project.org/web/views/Graphics.html), [Multivariate](http://cran.r-project.org/web/views/Multivariate.html), [Phylogenetics](http://cran.r-project.org/web/views/Phylogenetics.html), [Spatial](http://cran.r-project.org/web/views/Spatial.html), [Survival](http://cran.r-project.org/web/views/Survival.html) and [Time Series](http://cran.r-project.org/web/views/TimeSeries.html) Task Views all contain many useful packages for biologists and environmental scientists.
## Using packages {#use-packages}
Two things need to happen in order for us to use a package. First, we need to ensure that a copy of the folders and files that make up the package are copied to an appropriate folder on our computer. This process of putting the package files into the correct location is called __installing__ the package. Second, we need to __load and attach__ the package for use in a particular R session. As always, the word "session" refers to the time between when we start up R and close it down again. It's worth unpacking these two ideas a bit, because packages are a frequent source of confusion for new users:
* If we don't have a copy of a package's folders and files in the right format and the right place on our computer we can't use it. This is probably fairly obvious. The process of making this copy is called __installing__ the package. It is possible to manually install packages by going to the CRAN website, downloading the package, and then using various tools to install it. We won't be using this approach though because it's both inefficient and error prone. Instead, we'll use built-in R functions to grab the package from CRAN and install it for us, all in one step.
* We don't need to re-install a packages we plan to use every time we start a new R session. It is worth saying that again, __there is no need to install a package every time we start up R / RStudio__. Once we have a copy of the package on our hard drive it will remain there for us to use. The only exception to this rule is that a major update to R (not RStudio!) will sometimes require a complete re-install of the packages. This is because the R installer will not copy installed packages to the major new version of R. These major updates are fairly infrequent though, occurring perhaps every 1-2 years.
* Installing a package does nothing more than place a copy of the relevant files on our hard drive. If we actually want to use the functions or the data that comes with a package we need to make them available in our current R session. Unlike package installation this __load and attach__ process as it's known has to be repeated every time we restart R. If we forget to load up the package we can't use it.
### Viewing installed packages
We sometimes need to check whether a package is currently installed. RStudio provides a simple, intuitive way to see which packages are installed on our computer. The __Packages__ tab in the top right pane of RStudio shows the name of every installed package, a brief description (the same one seen on CRAN) and a version number. We can also manage our packages from this tab, as we are about to find out.
There are also a few R functions that can be used to check whether a package is currently installed. For example, the `find.package` function can do this:
```{r}
find.package("MASS")
```
This either prints a "file path" showing us where the package is located, or returns an error if the package can't be found. Alternatively, the function called `installed.packages` returns something called a data frame (these are discussed later in the book) containing a lot more information about the installed packages.
### Installing packages
R packages can be installed from a number of different sources. For example, they can be installed from a local file on a computer, from the CRAN repository, or from a different kind of online repository called Github. Although various alternatives to CRAN are becoming more popular, we're only going to worry about installing packages that live on CRAN in this book. This is no bad thing---the packages that live outside CRAN tend to be a little more experimental.
In order to install a package from an online repository like CRAN we have to first download the package files, possibly uncompress them (like we would a ZIP file), and move them to the correct location. All of this can be done at the Console using a single function: `install.packages`. For example, if we want to install a package called __fortunes__, we use:
```{r, message = TRUE, eval = FALSE}
install.packages("fortunes")
```
The quotes are necessary by the way. If everything is working---we have an active internet connection, the package name is valid, and so on---R will briefly pause while it communicates with the CRAN servers, we should see some red text reporting back what's happening, and then we're returned to the prompt. The red text is just letting us know what R is up to. As long as this text does not include the word "error", there is usually no need to worry about it.
There is nothing to stop us using `install.packages` to install more than one package at a time. We are going to use __dplyr__ and __ggplot2__ later in the book. Since neither of these is part of the base R distribution, we need to download and install them from CRAN. Here's one way to do this:
```{r, eval = FALSE}
pckg.names <- c("dplyr", "ggplot2")
install.packages(pckg.names)
```
There are a couple of things to keep in mind. First, package names are case sensitive. For example, __fortunes__ is not the same as __Fortunes__. Quite often package installations fail because we used the wrong case somewhere in the package name. The other aspect of packages we need to know about is related to __dependencies__: some packages rely on other packages in order to work properly. By default `install.packages` will install these dependencies, so we don't usually have to worry too much about them. Just don't be surprised if the `install.packages` function installs more than one package when only one was requested.
```{block, type="action"}
#### Install dplyr and ggplot2
We're going to be using **dplyr** and **ggplot2** packages later in the book. If they aren't already installed on your computer (check with `find.package`), now is a good time to install them so they're ready to use later.
```
RStudio provides a way of interacting with `install.packages` via point-and-click. The __Packages__ tab has an "Install"" button at the top right. Clicking on this brings up a small window with three main fields: "Install from", "Packages", and "Install to Library". We only need to work with the "Packages" field -- the other two can be left at their defaults. When we start typing in the first few letters of a package name (e.g. __dplyr__) RStudio provides a list of available packages that match this. After we select the one we want and click the "Install" button, RStudio invokes `install.packages` with the appropriate arguments at the Console for us.
```{block, type="warning"}
#### Never use `install.packages` in scripts
Because installing a package is a "do once" operation, it is almost never a good idea to place `install.packages` in a typical R script. A script may be run 100s of times as we develop an analysis. Installing a package is quite time consuming, so we don't really want to do it every time we run our analysis. As long as the package has been installed at some point in the past it is ready to be used and the script will work fine without re-installing it.
```
### Loading and attaching packages
Once we've installed a package or two we'll probably want to actually use them. Two things have to happen to access a package's facilities: the package has to be loaded into memory, and then it has to attached to something called a search path so that R can find it. It is beyond the scope of this book to get in to "how" and "why" of these events. Fortunately, there's no need to worry about these details, as both loading and attaching can be done in a single step with a function called `library`. The `library` function works exactly as we might expect it to. If we want to start using the `fortunes` package---which was just installed above---all we need is:
```{r}
library("fortunes")
```
Nothing much happens if everything is working as it should. R just returns us to the prompt without printing anything to the Console. The difference is that now we can use the functions that __fortunes__ provides. As it turns out, there is only one, called fortune:
```{r, eval=FALSE}
fortune()
```
```{r, eval=TRUE, echo=FALSE}
fortune("Cryer")
```
The __fortunes__ package is either very useful or utterly pointless, depending on ones perspective. It dispenses quotes from various R experts delivered to the venerable R mailing list (some of these are even funny).
Once again, if we really don't like working in the Console RStudio can help us out. There is a small button next to each package listed in the __Packages__ tab. Packages that have been loaded and attached have a blue check box next to them, whereas this is absent from those that have not. Clicking on an empty check box will load up the package. Try this. Notice that all it does is invoke `library` with the appropriate arguments for us (RStudio explicitly sets the `lib.loc` argument, whereas above we just relied on the default value).
```{block, type="warning"}
### Don't use RStudio for loading packages!
We looked at how it works, because at some point most people realise they can use RStudio to load and attach packages. We don't recommend using this route though. It's much better to put `library` statements into a script. Why? Because if we rely on RStudio to load packages, we have to do this every time we want to run a script, and if we forget one we need, the script won't work. This is another example of where relying on RStudio ultimately makes things more, not less, challenging.
One last tip: we can use library anywhere, but typically the `library` expressions live at the very beginning of a script so that everything is ready to use later on.
```
### An analogy
The package system often confuses new users. The reason for this stems from the fact that they aren't clear about what the `install.packages` and `library` functions are doing. One way to think about these is by analogy with smartphone "Apps". Think of an R package as being analogous to a smartphone App--- a package effectively extends what R can do, just as an App extends what a phone can do.
When we want to try out a new App we have to first download it from an App store and install it on our phone. Once it has been downloaded, an App lives permanently on the phone (unless we delete it!) and can be used whenever it's needed. Downloading and installing the App is something we only have to do once. Packages are no different. When we want to use an R package we first have to make sure it is installed on the computer. This is effectively what `install.packages` does: it grabs the package from CRAN (the "App store") and installs it on our computer. Installing a package is a "do once" operation. Once we've installed it, we don't need to install a package again each time we restart R. The package is sat on the hard drive ready to be used.
In order to actually use an App which has been installed on a phone we open it up by tapping on its icon. This obviously has to happen every time we want to use the App. The package equivalent of opening a smartphone App is the "load and attach" operation. This is what `library` does. It makes a package available for use in a particular session. We have to use `library` to load the package every time we start a new R session if we plan to access the functions in that package: loading and attaching a package via `library` is a "do every time" operation.