forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPerformance.rmd
409 lines (292 loc) · 25.6 KB
/
Performance.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
---
title: Performance
layout: default
---
# Performance
```{r, echo = FALSE}
library(microbenchmark)
options(digits = 3)
options("microbenchmark.unit" = "ns")
```
<!--
Once more complete, circulate to Justin, Radford, Alex & Jan for comments.
-->
## Introduction
R is not a fast computer language. This is not an accident: R has been thoughtfully designed to make it easier for you to solve data analysis and statistics challenges, not to make your computer's life easier. While R is slow compared to other programming languages, it certainly be can fast enough for most purposes. The goal of this part of the book is to give you a deeper understanding into R's performance characteristics. You'll learn where R is fast, where it's slow, and how you can improve performance when it becomes a problem.
The five chapters cover:
* In this chapter, I'll discuss some of the reasons why R is slow, and
help you build a better intuition for what operations are likely to be
problematic. I'll also talk a little about the future of R, and what
bottlenecks can be fixed automatically. Throughout the chapter I'll
use microbenchmarking to quantitatively explore R's performance
characterisics.
* In [Profiling](#profiling), you'll learn concrete tools for making
your code faster by first figuring out what you need to optimise and
then learning some general tools to optimise it.
* In [Memory](#memory), you'll learn about how R uses memory, and some
common performance problems.
* For really high-performance code, you can move outside of R and use
another programming language. [Rcpp](#rcpp) will teach you the absolute
minimum you need to know about C++ in order to write fast code with the
Rcpp package.
* Finally, if you want to deeply understand the performance of built in
base functions, you'll need to learn a little bit about R's C api. In
[R's C interface](#c-api), you'll learn more about R's C internals
and see how some common built-in functions work.
Let's get started by learning more about why R is slow.
## Why is R slow?
To understand R's performance it helps to think about R in two ways: as a language, and as an implementation. The R-language is abstract. It defines what R code means and how it should work. The is only one language, but there might be multiple implementations. An implementation is concrete: you give it R code and it computes the result. To make this distinction clear I'll use R-language to refer to the language, and GNU-R to refer to the implementation you download from [r-project.org](http://r-project.org).
The distinction between language and implementation is a bit murky for R because there is only one implementation, and the language is mostly defined in terms of that implementation. Other languages, like [C++](http://isocpp.org/std/the-standard) and [javascript](http://www.ecma-international.org/publications/standards/Ecma-262.htm) make the distinction more clear. They have formal specifications that describe in minute detail how every aspect of the language should work, and multiple implementation. Even though the distinction between the R-language and GNU-R isn't so clear cut, it's still useful, because it lets us separate fundamental and incidental reasons for slowness.
The R-language was not designed to generate fast machine-code, it was designed to facilitate data analysis. S, the precursor of R, was basically used to stitch together various specialised Fortran and C programs into a coherent analysis. This vision has deeply informed R. The R-language favours expressiveness over speed, because you can also connect to faster languages when you need to. R is not designed to be fast, it's designed to be expressive. This imposes some fundamental challenges to making R fast, as discussed in [language performance](#language-performance).
The design of the language constrains the maximum possible performance, but GNU-R is far from the best possible speed. It's hard to know exactly how much faster a better implementation could be, but a 10x improvement in speed seems achievable. I'll discuss some of the features of the current that are slow and why performance improvements to GNU-R happen slowly in [implementation performance](#implementation-performance). Finally, in [alternative implementations](#faster-r) I'll discuss some of the promising new implementations of R, and illustrate the most important idea they use to make R code faster.
It's also worthwhile to bear in mind that most of the base packages in GNU-R is not written in R. The most important bottlenecks have already been written in C or Fortran. This means that even if R did get 10x or 100x faster, the impact on your code is likely to be much smaller.
Finally, possibly the most important reason why most R code is slow is because most R code is poorly written and slow. Hardly any R users are professional software developers, and few are programmers by training. Most people using R use it to better understand their data, and it's generally more important to get a solution quickly rather than to develop a system that will work for a wide variety of inputs. This means that most R code can be made much faster, as described in [profiling](#profiling).
If you'd like to learn more about the performance characteristics of the R-language and how they affect real code, I'd particularly recommend the article [Evaluating the Design of the R Language](https://www.cs.purdue.edu/homes/jv/pubs/ecoop12.pdf) by Floreal Morandat, Brandon Hill, Leo Osvald and Jan Vitek. It discusses a power methodology for understanding the performance characteristics of R using a modified R interpreter and a wide set of code found in the wild.
Before we continue on to explore some of the slow parts of the R-language and GNU-R, we need to learn a little about benchmarking, so that we can make our intuition about performance concrete.
## Microbenchmarking
A microbenchmark is a performance measurement of a very small piece of code, something that might take microseconds (µs) or nanoseconds (ns) to run. I'm going to use microbenchmarks to demonstrate the performance of very low-level pieces of R, to help develop your intution for how R works. This intuition, by-and-large, is not practically useful because in most real code saving microseconds does not have an appreciable effect on run time. You should not change the way you code because of these microbenchmarks, but instead wait until the next chapter to learn how to speed up your code.
The best tool for microbenchmarking in R is (surprise!) the [microbenchmark][microbenchmark] package, which offers very precise timings, making it possible to compare operations that only take a tiny amount of time. For example, the following code compares the speed of two ways of computing a square root. `microbenchmark()` takes a multiple expressions as input, and returns information about the distribution of timings. By default, `microbenchmark()` runs each expression 100 times, which you can control with the `times` parameter. It summarises the results with a minimum (`min`), lower quartile (`lq`), median, upper quartile (`uq`) and maximum (`max`). You normally want to focus on the median. The upper and lower quartiles (`lq` and `uq`) are also useful to get a feel for the variability of the timings.
```{r}
library(microbenchmark)
x <- runif(100)
microbenchmark(
sqrt(x),
x ^ 0.5
)
```
In this example, you can see that using the special purpose `sqrt()` function is faster than general exponentiation operator. Note the units: each computation takes about 800 ns, 800 billionths of a second.
`microbenchmark()` times each expression 100 times, it doesn't time 100 expression and return only the average time. This is important because it can randomise the order of the timings controlling for systematic variability. While the deafult print method only displays a five number summary, all individual timings are stored in the output object. You can visualise the variability by using the `boxplot()` method, or if if ggplot2 is loaded, with `autoplot()`.
To help calibrate the impact of a microbenchmark on run time, it's helpful to think about how many times an function needs to run before it takes a second. If a microbenchmark takes:
* 1 ns, then one billion calls takes a second
* 1 µs, then one million calls takes a second
* 1 ms, then one thousand calls takes a second
### Exercises
* Instead of using `microbenchmark()`, you could use the built-in
`system.time()`. But `system.time()` is much less precise, so you'll
need to repeat each operation many times with a loop, and then divide
to find the average time of each operation, as in the code below.
How do the estimates from `system.time()` compare to those from
`microbenchmark()`? Why are they different?
```{r, eval = FALSE}
n <- 1:1e6
system.time(for (i in n) sqrt(x)) / length(n)
system.time(for (i in n) x ^ 0.5) / length(n)
```
* Here are two other ways to compute the square root of a vector. Which
do you think will be fastest? Which will be slowest? Use microbenchmarking
to confirm your answers.
```{r, eval = FALSE}
x ^ (1 / 2)
exp(log(x) / 2)
```
* Use microbenchmarking to rank the basic arithmetic operators (`+`, `-`,
`*`, `/`, and `^`) in terms of speed. Visualise the results.
## Language performance
In this section I'll explore three trade-offs that limit the performance of the R-language: extreme dynamism, few built-in constants, and lazy evaluation of function arguments. I'll illustrate each reason with a microbenchmark, showing how it slows GNU-R down. You can't benchmark a language, since it's an abtract construct, so the benchmarks are only suggestive of the cost of these decisions to the language, but are nevertheless useful.
Designing a useful language is a delicate balancing act. There are many options and many tradeoffs, and you need to balance between speed, flexibility and ease of implementation. I hope the three examples will give you a sense of this balance.
### Extreme dynamism
R is an extremely dynamic programming language, and almost anything can be modified after it's created. You can:
* change the body, arguments and environment of functions
* change the S4 methods for a generic
* add new fields to an S3 object, or even change its class
* modify objects outside of the local environment with `<<-`
The big advantage of dyanmism is that you don't need to do upfront planning, and you don't need an initial compilation step. You can change your mind at any point, without having to start again from scratch.
The disadvantage of dynamism is that it makes code slow because it's harder to predict exactly what will happen for a given function call. The easier it is to predict what's going to happen, the more likely an interpreter or compiler can jump directly to the fastest implementation. (If you'd like more details, Charles Nutter expands on this idea at [On Languages, VMs, Optimization, and the Way of the World](http://blog.headius.com/2013/05/on-languages-vms-optimization-and-way.html).) If an interpreter can't predict what's going to happen, it has to look through many options to find the best one, which is an expensive operation.
The following microbenchmark illustrates the cost of method dispatch for S3, S4, and RC. I create a generic and a method for each OO system, then call the generic and see how long it takes to find and call the method. I also time how long it takes to call the bare method for comparison.
```{r, results = 'hide'}
f <- function(x) NULL
s3 <- function(x) UseMethod("s3")
s3.integer <- f
A <- setClass("A", representation(a = "list"))
setGeneric("s4", function(x) standardGeneric("s4"))
setMethod(s4, "A", f)
B <- setRefClass("B", methods = list(rc = f))
a <- A()
b <- B$new()
```
```{r}
microbenchmark(
fun = f(),
S3 = s3(1L),
S4 = s4(a),
RC = b$rc()
)
```
On my computer, the function call takes about 280 ns. S3 method dispatch takes an additional 3,000 ns; S4 dispatch, 13,000 ns; and RC dispatch, 11,000 ns. Method dispatch is so expensive because R must look for the function every time the method is called; it might have changed between this time and the last time. R could do better by caching methods between calls, but caching is hard to do correctly and a notorious source of bugs. In many situations, compiled languages only need to method lookup once, during compilation, because they know that method can't be change at run-time. This makes the cost of method dispatch extremely small.
### Few built-in constants
R has very few number of built-in constants. Almost every operation is a lexically scoped function call. For example, in the following simple function contains four function calls: `{`, `(`, `+`, `^`.
```{r}
f <- function(x, y) {
(x + y) ^ 2
}
```
These functions are not hard coded constants, and can be overriden by you. This means that to find the definition of each function, R has to look through every environment on the search path, which could easily be 10 or 20 environments. It would be possible to change this behaviour. It would probably affect very little code (it's a bad idea to override `{` or `(`!), but it would require substantial work by R core.
The following microbenchmark hints at the performance costs. We create four versions of `f()`, each with one more environment (containing 26 bindings) between the environment of `f()` and the base environment where `+`, `^`, `(`, and `{` are defined.
```{r}
random_env <- function(parent = globalenv()) {
letter_list <- setNames(as.list(runif(26)), LETTERS)
list2env(letter_list, envir = new.env(parent = parent))
}
set_env <- function(f, e) {
environment(f) <- e
f
}
f2 <- set_env(f, random_env())
f3 <- set_env(f, random_env(environment(f2)))
f4 <- set_env(f, random_env(environment(f3)))
f_b <- set_env(f, baseenv())
microbenchmark(
f(1, 2),
f2(1, 2),
f3(1, 2),
f4(1, 2),
times = 1000
)
```
Each additional environment between `f()` and the base environment makes the function slower by about 50ns. Most other languages have many more built in constants that you can't override. This means that they always know exactly what `+`, `-`, `{` and `(` and they don't need to waste time repeatedly looking up their definitions. The cost of that decision is it means make writing the interpreter a little harder (because there are more special cases), and the language isn't quite as flexible.
### Lazy evaluation overhead
In R, functions arguments are evaluated lazily (as discussed in [lazy evaluation](#lazy-evaluation) and [capturing expressions](#capturing-expressions)). To implement lazy evaluation, when a function is called, R creates a promise object that contains the expression needed to compute the result, and the environment in which to perform the computation. Creating these objects has some overhead, so every additional argument to an R function slows it down a little.
The following microbenchmark compares the run time of a very simple function. Each version of the function has one extra argument, which allows us to see that each an additional argument costs about 20 ns.
```{r promise-cost}
f0 <- function() NULL
f1 <- function(a = 1) NULL
f2 <- function(a = 1, b = 1) NULL
f3 <- function(a = 1, b = 2, c = 3) NULL
f4 <- function(a = 1, b = 2, c = 4, d = 4) NULL
f5 <- function(a = 1, b = 2, c = 4, d = 4, e = 5) NULL
m <- microbenchmark(f0(), f1(), f2(), f3(), f4(), f5(), times = 1000)
```
In most other programming languages there is no overhead for adding extra arguments. Many compiled languages will even warn you if arguments are never used, and automatically remove them from the function.
### Exercises
* How does the performance of S3 method dispatch change with the length
of the class vector? How does performance of S4 method dispatch change
with number of superclasses? How about RC?
* `scan()` has the most arguments (21) of any base function. About how
much time does it take to make 21 promises each time scan is called?
Given a simple input (e.g. `scan(text = "1 2 3", quiet = T)`) what
proportion of the total run time is due to creating those promises?
## Implementation performance
The design of R-language limits its maximum theoretical performance. But GNU-R is currently nowhere close to that maximum, and there are many things that can (and will) be done to speed it up. This section discusses some of the parts of GNU-R that are currently slow not because of the language, but because no one has had the time to make them fast.
R is over 20 years old, and contains nearly 800,000 lines of code (about 45% C, 19% R, and 17% fortran). Changes to base R can only be made by members of the R Core Team (or R-core for short). R-core contains [20 members](http://www.r-project.org/contributors.html), but only six are actively involved in day-to-day development. No one on R-core is paid to work on R full time. Most are statistics professors, and can only spend a relatively small amount of their time working on R.
Because R-core lacks full-time software developers, it has accumulated considerable technical debt. One component of this debt is the lack of unit tests. Without unit tests, it's difficult to improve performance without accidentally changing behaviour. Because of the care that must be taken to avoid breaking existing code, R-core tends to be very conservation about accepting new code. It can be frustrating to see obvious performance improvements rejected by R-core, but the driving motivation for R-core is not to make R faster, it's to build a stable platform for statistical computing.
Next, I'll show two small illustration of slow parts of R, but could be made faster with some effort. They are not critical parts of base R, but they have frustrated me in the past. As with the microbenchmarks above, these do not generally impact the average performance of most funtion, but can be important for special cases.
### Extracting a single value from a data frame
The following microbenchmark shows seven ways to access a single value (the number in the bottom-right corner) from the built-in `mtcars` dataset. The variation in performance is startling: the slowest method takes 30x longer than the method. There's no reason for there to be such a huge difference in performance, but no one has spent the time to make the slowest methods faster.
```{r}
microbenchmark(
mtcars["Volvo 142E", "carb"],
mtcars[32, 11],
mtcars[[c(11, 32)]],
mtcars[["carb"]][32],
mtcars[[11]][32],
mtcars$carb[32],
.subset2(mtcars, 11)[32],
unit = "us"
)
```
### `ifelse()`, `pmin()`, and `pmax()`.
Some base functions are known to be slow. For example, look at the following three implementations of a function to `squish()` a vector to make the sure smallest value is at least `a` and the largest value is at most `b`. The first implementation, `squish_ife()` uses `ifelse()`. `ifelse()` is known to be slow because it is relatively general, and must evaluate all arguments fully. The second implementation, `squish_p()`, uses `pmin()` and `pmax()` which should be much faster because they're so specialised. But they're actually rather slow because they can take any number of arguments and have to do some relatively complicated checks to determine which method to use. The final implementation uses basic subassignment.
```{r}
squish_ife <- function(x, a, b) {
ifelse(x <= a, a, ifelse(x >= b, b, x))
}
squish_p <- function(x, a, b) {
pmax(pmin(x, b), a)
}
squish_in_place <- function(x, a, b) {
x[x <= a] <- a
x[x >= b] <- b
x
}
x <- runif(100, -1.5, 1.5)
microbenchmark(
squish_ife(x, -1, 1),
squish_p(x, -1, 1),
squish_in_place(x, -1, 1),
unit = "us"
)
```
There's quite a variation in speed: using `pmin()` and `pmax()` is about 3x faster than using `ifelse()`, and using subsetting directly is about twice as fast again. As we'll see in [Rcpp](#rcpp), we can often do even better by calling out to C++. The following example compares the best R implementation to a relatively simple (if verbose) implementation in C++. Even if you've never used C++, you should be able to follow the basic strategy: we loop over every element in the vector and performance a difference action depending on whether the values less than `a`, greater than `b`, or ok as is. The C++ implementation is around 3x faster.
```{r, engine = "Rcpp"}
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector squish_cpp(NumericVector x, double a, double b) {
int n = x.length();
NumericVector out(n);
for (int i = 0; i < n; ++i) {
double xi = x[i];
if (xi < a) {
out[i] = a;
} else if (xi > b) {
out[i] = b;
} else {
out[i] = xi;
}
}
return out;
}
```
```{r}
microbenchmark(
squish_in_place(x, -1, 1),
squish_cpp(x, -1, 1),
unit = "us"
)
```
### Exercises
* The performance characteristics of `trunc_ife()`, `trunc_p()` and
`trunc_in_place()` vary considerably with the size of `x`. Explore the
differences. For what sizes of input is the difference biggest? Smallest?
* Compare the performance costs of extract an element from a list, a
column from a matrix, and a column from a data frame. What about a row?
## Alternative R implementations {#faster-r}
There are some exciting new implementations of R. They all try to stick as closely as possible to the existing language definition, but implement ideas from modern interpreter design to make the implementation as fast as possible. The three most mature projects are:
* [Pretty quick R](http://www.pqr-project.org/), by Radford Neal. Built on top of existing R code base (2.15.0). Fixes many obvious performance issues. Better memory management.
* [Renjin](http://www.renjin.org/) by BeDataDrive. JVM. Extensive [test suite](http://packages.renjin.org/). Good discussion at http://www.renjin.org/blog/2013-07-30-deep-dive-vector-pipeliner.html. Along the same lines, but currently less developed in [fastr](https://github.com/allr/fastr). Written by the same group as traceR paper.
* [Riposte](https://github.com/jtalbot/riposte), by Justin Talbot and Zachary DeVito. Experimental VM. (http://www.justintalbot.com/wp-content/uploads/2012/10/pact080talbot.pdf)
These are roughly ordered in from most practical to most ambitious. There is one other project that currently does not provide any performance improvements, but might provide a better foundation for future improvements:
* [CXXR](http://www.cs.kent.ac.uk/projects/cxxr/). Reimplementation of R into clean C++. Not currently faster, but is much more extensible and might form clean foundation for future work. Better documentation of intenrals. http://www.cs.kent.ac.uk/projects/cxxr/pubs/WIREs2012_preprint.pdf. Behaviour identical.
R is a huge language and it's not clear whether any of these approaches will even become mainstream. It's hard task to make sure an alternative backend can run all R code in the same way (or similar enough) to GNU R. The challenge with any rewrite is maintaining compatibility with base R. Can you imagine having to reimplement every function in base R to not only be faster, but to have exactly the same documented bugs? (e.g. `nchar(NA)`).
However, even if these implementations never make a dent in the use of GNU R, they have other benefits:
* Simpler implementations make it easy to validate new approaches before
porting to the GNU R.
* Gain understanding about which currently core features could be changed
with minimal changes to existing code and maximal impact on performance.
* Alternative implementations put pressure on the R Core Team to incorporate
performance improvements.
One of the most important approaches that pqr, renjin and riposte are exploring is the idea of deferred evaluation. As Justin Talbot, the author of riposte, points out "for long vectors, R's execution is completely memory bound. It spends almost all of its time reading and writing vector intermediates to memory". If you can eliminate intermediate vectors, you can not only decrease memory usage, you can also considerably improve performance.
The following example shows a very simple example of where deferred evaluation might help. We have three vectors, `x`, `y`, `z` each containing 1 million elements, and we want to find the average value of `x` + `y` where `z` is TRUE. (This represents a simplification of a pretty common sort of data analysis question.)
```{r}
x <- runif(1e6)
y <- runif(1e6)
z <- sample(c(T, F), 1e6, rep = TRUE)
mean((x + y)[z])
```
In R, this creates two big temporary vectors: `x + y`, 1 million elements long, and `(x + y)[z]`, about 500,000 elements long. This means you need to have extra memory free for the intermediate calculation, and allocating and freeing that memory is relatively expensive so it slows the computation down.
If we rewrote the function using a loop in a language like C++, we could recognise that we only need one intermediate value: the sum of all the values we've seen:
```{r, engine = "Rcpp"}
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double cond_mean_cpp(NumericVector x, NumericVector y, LogicalVector z) {
double sum = 0;
int n = x.length();
for(int i = 0; i < n; i++) {
if (!z[i]) continue;
sum += x[i] + y[i];
}
return sum / n;
}
```
On my computer, this approach is about 8 times faster than the vectorised R equivalent (which is already pretty fast).
```{r}
cond_mean_r <- function(x, y, z) {
mean((x + y)[z])
}
microbenchmark(
cond_mean_cpp(x, y, z),
cond_mean_r(x, y, z)
)
```
Riposte, renjin and pqr all provide tools to do this sort of transformation automatically, so you can write concise R code and have it automatically translated into efficient machine code. Sophisticated translators can also figure out how to make the most of multiple cores. In the above example, if you have four cores, you could give 1/4 of the vectors to each core and then add together the results, which should give something pretty close to a 4-fold speed up. Similar tools can also work with naive for loops, discovering operations that can be vectorised.