forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathIntroduction.rmd
177 lines (122 loc) · 13.2 KB
/
Introduction.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
title: Introduction
layout: default
---
# Introduction
I have been programming in R for over 10 years, spending a lot of time trying to figure out how the language works. Not everyone has the luxury of spending years to understand a programming language, so this book is my attempt to help you to become an effective R programmer as quickly as possible. By reading this book, you will avoid the many mistakes that I made and the many dead ends that I got stuck in, and quickly navigate your way to useful tools and techniques. Although R has its frustrating quirks, I truly believe that at its heart lies an elegant and beautiful language, well tailored for data analysis and statistics. In this book, I'll do my best to reveal that language to you, helping you to understand powerful idioms that allow you to attack many types of problems.
If you are new to R, you might wonder what makes learning such a quirky language worthwhile. To me, some of the best features of R are:
* It is free, open source and easily available on every major platform:
if you do your analysis in R, anyone else can replicate it for free.
* A massive set of packages for statistical modelling, machine learning,
visualisation, data import and data manipulation. Whatever model or
visualisation you're trying to do, the chances are that someone has tried
to do it before and you can learn from their efforts.
* Researchers in statistics and machine learning will often publish an
R package to accompany their research articles. That means you can use
cutting edge research as soon as it's avaiable.
* Deep features of the language support data analysis. These included missing
values, data frames, and subsetting.
* A fantastic community. It is easy to get help from experts on the
[R-help mailing list](https://stat.ethz.ch/mailman/listinfo/r-help),
[stackoverflow](http://stackoverflow.com/questions/tagged/r),
or subject specific mailing lists like
[R-SIG-mixed-models](https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models)
or [ggplot2](https://groups.google.com/forum/#!forum/ggplot2). You
can also connect with other R learners via
[twitter](https://twitter.com/search?q=%23rstats),
[linkedin](http://www.linkedin.com/groups/R-Project-Statistical-Computing-77616),
and through many local
[user groups](http://blog.revolutionanalytics.com/local-r-groups.html).
* Powerful tools for communicating your results. R packages make it easy to
produce html or pdf [reports](http://yihui.name/knitr/), or create
[interactive websites](http://www.rstudio.com/shiny/).
* A strong functional programming foundation. The ideas of functional
programming are well suited to solving many data analysis challenges and
the tools in R give you a powerful and flexible toolkit which allows you
write concise yet descriptive code.
* An [IDE](http://www.rstudio.com/ide/) tailored to the needs of interactive
data analysis and statistical programming.
* Powerful metaprogramming facilities. R is not just a programming language,
but it is also an environment for interactive data analysis. Metaprogramming
makes it possible to write succinct functions that are a little magical
in their ability to reduce the amount of typing you need. R also provides an
excellent environment for designing domain specific languages.
* R is designed to connect to high-performance programming languages like C,
Fortran and C++.
Of course, R is not perfect. R's biggest challenge is that most R users are not programmers. This means that:
* Much R code you'll see in the wild is not very elegant, fast or easy to
understand. It's usually written in haste to solve a pressing problem, and
has not been rewritten for clarity, elegance or performance.
* Compared to other programming languages, the R community tends to be more
focussed on results, not processes. Knowledge of software engineering
best practices is patchy, not enough R programmers use source code control
or automated testing.
* Metaprogramming is a double-edged sword, and too many R functions use
tricks to reduce the amount of typing at the cost of making code that
is hard to understand and that fails in unexpected ways.
* Inconsistency is rife across contributed packages and even within base R.
The R APIs have evolved over 20 years, and you are confronted with
that fact every time you use R. Having to memorise many special cases
makes learning R tough.
* R is not a particularly fast programming language, and poorly written R code
can be terribly slow. R is also a profligate user of memory, and tools for
parallel processing are not mature.
Personally, I think these challenges are great opportunities for experienced programmers to have a profound positive impact on the R community. R users do care about writing high quality code, particularly for reproducible research, but they don't yet have the skills to do it. I hope this book will help more R users to become R programmers, and encourage programmers from other languages to contribute to R.
## Who should read this book
This book is aimed at two complementary audiences:
* Intermediate R programmers who want to dive deeper into R and learn more
strategies for solving diverse problems
* Programmers from other languages who are learning R, and want to understand
why R works the way it does.
To get the most out of this book, you will need to have written a decent amount of code in either R or other programming languages. You should be familiar with how functions work in R, although you might not know all the details, and you should be somewhat familiar with the apply family of functions (like `apply()` and `lapply()`), although you may currently struggle to use them effectively.
## What you will get out of this book
This book describes the skills that I think you need to be an advanced R programmer, producing reusable code that can be used in a wide variety of circumstances.
After reading this book, you will:
* Be familiar with the fundamentals of R, so that you can understand complex
data types and simplify the operations performed on them. You will have a
deep understanding of how functions work, and be able to recognise and use
the four object systems in R.
* Understand what functional programming means, and why it is useful tool
for data analysis. You'll be able to use many existing new tools, and
have the knowledge to create your own new functional tools when needed.
* Appreciate the double-edged sword of metaprogramming. You'll be able to
create functions that use non-standard evaluation in a principled way,
saving typing and creating elegant languages for expressing important
operations. You'll also understand the dangers of metaprogramming,
why you should be careful about its use.
* Have a good intuition for what operations in R are slow, or use a lot of
memory. You'll know how to use profiling to pinpoint performance
bottlenecks, and you'll know enough C++ to convert slow R functions to
fast C++ equivalents.
* Be comfortable reading and understanding the majority of R code.
You'll recognise common idioms (even if you wouldn't use them yourself)
and be able to critique others' code.
## Meta-techniques
There are two meta-techniques that are tremendously helpful for improving your skills as an R programmer: reading the source, and adopting a scientific mindset.
Reading source code is a tremendously useful technique because it exposes you to new ways of doing things. Over time you'll develop a sense of taste as an R programmer, and even if you find something your taste violently objects to, it's still helpful: emulate the things you like and avoid the things you don't like. I think the clarity of my code increased considerably once I started grading code in the classroom, and was exposed to a lot of code I couldn't make heads nor tails of! I think it's a great idea to start by reading the source code for the functions and packages that you use most frequently. Reading the source becomes even more important when you start using more esoteric parts of R; often the documentation will be lacking, and you'll need to figure out how a function works by reading the source and experimenting.
A scientific mindset is extremely helpful when learning R. If you don't understand how something works, develop a hypothesis, design some experiments, run them and record the results. This exercise is extremely useful if you can't figure it out and need to get help from others: you can easily show what you tried, and when you learn the right answer, you'll be mentally prepared to update your world view. I often find that when I clearly describe a problem to someone else (the art of creating a [reproducible example](http://stackoverflow.com/questions/5963269)), I figure out the solution myself.
## Recommended reading
R is still a relatively young language, and the resources to help you understand it are still maturing. In my personal journey to understand R, I've found it particularly helpful to refer to resources from other programming languages. R has aspects of both functional and object-oriented (OO) programming languages, and learning how these aspects are expressed in R will help you translate your existing knowledge from other programming languages, and to help you identify areas where you can improve.
To understand why R's object systems work the way they do, I found [The Structure and Interpretation of Computer Programs](http://mitpress.mit.edu/sicp/full-text/book/book.html) (SICP), by Harold Abelson and Gerald Jay Sussman, particularly helpful. It is a concise but deep book, and after reading it I felt for the first time that I could actually design my own object-oriented system. It was my first introduction to the generic function style of OO common in R, and it helped me understand its strengths and weaknesses. SICP also talks a lot about functional programming, and how to create functions that are simple in isolation, but powerful in combination.
To understand the tradeoffs that R has made compared to other programming languages, I found [Concepts, Techniques and Models of Computer Programming](http://amzn.com/0262220695?tag=devtools-20) by Peter van Roy and Sef Haridi, extremely helpful. It helped me understand that R's copy-on-modify semantics make it substantially easier to reason about code, and while the current implementation in R is not very efficient, that it is a solvable problem.
If you want to learn to be a better programmer, there's no place better to turn than [The Pragmatic Programmer](http://amzn.com/020161622X?tag=devtools-20), by Andrew Hunt and David Thomas. This book is program language agnostic, and provides great advice for how to be a better programmer.
## Getting help
Currently, there are two main venues to get help when you are stuck and can't figure out what's causing the problem: [stackoverflow](http://stackoverflow.com) and the R-help mailing list. You can get fantastic help in both venues, but they do have their own culture and expectations. It's usually a good idea to spend a little time lurking, and learning about community expectations before your first post.
Some good general advice:
* Make sure you have the latest version of R, and the package (or packages)
you are having problems with. It may be that your problem is the result of
a bug that has been fixed recently.
* Spend some time creating a
[reproducible example](http://stackoverflow.com/questions/5963269). This
is often a useful process in its own right, because in the course of making
the problem reproducible you figure out what's causing the problem.
* Look for related problems before posting. If someone has already asked
the question and been answered, it's much faster for everyone if you use
the existing answer.
## Acknowledgements {#intro-ack}
I would like to thank the tireless contributors to R-help and, more recently, [stackoverflow](http://stackoverflow.com/questions/tagged/r). There are too many to name individually, but I'd particularly like to thank Luke Tierney, John Chambers, Dirk Eddelbuettel, JJ Allaire and Brian Ripley for giving deeply of their time and correcting my countless misunderstandings.
This book was [written in the open](https://github.com/hadley/adv-r/), and chapters were advertised on [twitter](https://twitter.com/hadleywickham) when complete. It is truly a community effort: many people read the drafts, fixed typos, suggested improvements and contributed content. Without those contributors, the book wouldn't be nearly as good as it is, and I'm deeply grateful for their help.
(Before final version, remember to use `git shortlog` to list all contributors)
## Colophon
This book was written in [Rmarkdown](http://www.rstudio.com/ide/docs/authoring/using_markdown) inside [Rstudio](http://www.rstudio.com/ide/). [knitr](http://yihui.name/knitr/) and [pandoc](http://johnmacfarlane.net/pandoc/) converted the raw Rmarkdown to html and pdf. The [website](http://adv-r.had.co.nz) was made with [jekyll](http://jekyllrb.com/), styled with [bootstrap](http://getbootstrap.com/), and published to amazon's [S3](http://aws.amazon.com/s3/) with [s3_website](https://github.com/laurilehmijoki/s3_website). The complete source is available from [github](https://github.com/hadley/adv-r).
The code font is [inconsolata](http://levien.com/type/myfonts/inconsolata.html).