forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathComputing-on-the-language.rmd
707 lines (510 loc) · 30.6 KB
/
Computing-on-the-language.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
---
title: Metaprogramming
layout: default
---
# Metaprogramming
> "Flexibility in syntax, if it does not lead to ambiguity, would seem a
> reasonable thing to ask of an interactive programming language."
> --- Kent Pitman, <http://www.nhplace.com/kent/Papers/Special-Forms.html>
R has powerful tools for computing not only on values, but also on the actions that lead to those values. These tools are powerful and magical, and one of the most surprising features if you're coming from another programming language. Take the following simple snippet of code that draws a sine curve:
```{r plot-labels}
x <- seq(0, 2 * pi, length = 100)
sinx <- sin(x)
plot(x, sinx, type = "l")
```
Look at the labels on the axes! How did R know that the variable on the x axis was called `x` and the variable on the y axis was called `sinx`? In most programming languages, you can only access values of the arguments provided to functions, but in R you can also access the expression used to compute them. Combined with R's lazy evaluation mechanism this gives function authors considerable power to both access the underlying expression and do special things with it.
Techniques based on these tools are generally called "computing on the language", and in R provide a set of tools with power equivalent to functional and object oriented programming. This chapter will introduce you to the basic ideas of special evaluation, show you how they are used in base R, and how you can use them to create your own functions that save typing for interactive analysis. These tools are very useful for developing convenient user-facing functions because they can dramatically reduce the amount of typing required to specify an action.
Computing on the language is an extremely powerful tool, but it can also create code that is hard for others to understand and is substantially harder to program with. Before you use it, make sure that you have exhausted all other possibilities. You'll also learn about the downsides: because these tools work with expressions rather than values this increases ambiguity in the function call, and makes the function difficult to call from another function.
The following chapters ([expressions](#expressions) and [special environments](#special-environments)), expand on these ideas, discussing the underlying data structures and how you can understand and manipulate them to create new tools.
In this chapter you'll learn:
* how many functions such as `plot()` and `data.frame()` capture the names of the variable supplied to them, and the downsides of this technique.
* Manipulate a data frame by referring to the variables: `subset()`, `transform()`, `plyr::mutate()`, `plyr::arrange()`, `plyr::summarise()`, `with()`
* Work around non-standard evaluation (like lattice functions) with `substitute`.
* Capture an expression for later evaluation: ggplot2 and plyr
* Use formulas to describe computations: `lm()` and the lattice package
## Capturing expressions
The tool that makes non-standard evaluation possible in R is `substitute()`. It looks at a function argument, and instead of seeing the value, it looks to see how the value was computed:
```{r}
f <- function(x) {
substitute(x)
}
f(1:10)
f(x)
f(x + y ^ 2 / z + exp(a * sin(b)))
```
We won't worry yet about exactly what sort of object `substitute()` returns (that's the topic of the [Expressions](#expressions) chapter), but we'll call it an expression. (Note that it's not the same thing as returned by the `expression()` function: we'll call that an expression _object_.)
`substitute()` works because function arguments in R are only evaluated when they are needed, not automatically when the function is called. This means that function arguments are not just a simple value, but instead store both the expression to compute the value and the environment in which to compute it. Together these two things are called a __promise__. Most of the time in R, you don't need to know anything about promises because the first time you access a promise it is seamlessly evaluated, returning its value.
We need one more function if we want to understand how `plot()` and `data.frame()` work: `deparse()`. This function takes an expression and converts it to a character vector.
```{r}
g <- function(x) deparse(substitute(x))
g(1:10)
g(x)
g(x + y ^ 2 / z + exp(a * sin(b)))
```
There's one important caveat with `deparse()`: it can return a multiple strings if the input is long:
```{r, eval = FALSE}
g(a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + p + q + r + s + t + u + v + w + x + y + z)
```
If you need a single string, you can work around this by using the `width.cutoff` argument (which has a maximum value of 500), or by joining the lines back together again with `paste()`.
You might wonder why we couldn't use our original `f()` to compute `g()`. Let's try it:
```{r}
g <- function(x) deparse(f(x))
g(1:10)
g(x)
g(x + y ^ 2 / z + exp(a * sin(b)))
```
This is one of the downsides of functions that use `substitute()`: because they use the expression, not the value, of an argument, it becomes harder to call them from other functions. We'll talk more about this problem and some remedies later on.
There are a lot of functions in base R that use these ideas. Some use them to avoid quotes:
```{r, eval = FALSE}
library(ggplot2)
library("ggplot2")
```
Others use them to provide default labels. For example, `plot.default()` has code that basically does (the real code is more complicated because of the way base plotting methods work, but it's effectively the same):
```{r, eval = FALSE}
plot.default <- function(x, y = NULL, xlabel = NULL, ylabel = NULL, ...) {
...
xlab <- if (is.null(xlabel) && !missing(x)) deparse(substitute(x))
ylab <- if (is.null(xlabel) && !missing(y)) deparse(substitute(y))
...
}
```
If a label is not set and the variable is present, then the expression used to generate the value of `x` is used as a default value for the label on the x axis.
`data.frame()` does a similar thing. It automatically labels variables with the expression used to compute them:
```{r}
x <- 1:4
y <- letters[1:4]
names(data.frame(x, y))
```
This wouldn't be possible in most programming langauges because functions usually only see values (e.g. `1:4` and `c("a", "b", "c", "d")`), not the expressions that created them (`x` and `y`).
## Non-standard evaluation in subset {#non-standard-evaluation}
Just printing out the expression used to generate an argument value is useful, but we can do even more with the unevaluated function. For example, take `subset()`. It's a useful interactive shortcut for subsetting data frames: instead of repeating the data frame you're working with again and again, you can save some typing:
```{r, eval = FALSE}
subset(mtcars, cyl == 4)
# equivalent to:
# mtcars[mtcars$cyl == 4, ]
subset(mtcars, vs == am)
# equivalent to:
# mtcars[mtcars$vs == mtcars$am, ]
```
Subset is special because `vs == am` or `cyl == 4` aren't evaluated in the global environment: instead they're evaluated in the data frame. In other words, `subset()` implements different [scoping](#lexical-scoping) rules so instead of looking for those variables in the current environment, `subset()` looks in the specified data frame. This is called __non-standard evaluation__: you are deliberately breaking R's usual rules in order to do something special.
How does `subset()` work? We've already seen how to capture the expression that represents an argument, rather than its value, so we just need to figure out how to evaluate that expression in the right context: i.e. `cyl` should be interpreted as `mtcars$cyl`. To do this we need `eval()`, which takes an expression and evaluates it in the specified environment.
But before we can do that, we need to learn one more useful function: `quote()`. It's similar to `substitute()` but it always gives you back exactly the expression you entered. This makes it useful for interactive experimentation.
```{r}
quote(1:10)
quote(x)
quote(x + y ^ 2 / z + exp(a * sin(b)))
```
Now let's experiment with `eval()`. If you only provide one argument, it evaluates the expression in the current environment. This makes `eval(quote(x))` exactly equivalent to typing `x`, regardless of what `x` is:
```{r, error = TRUE}
eval(quote(x <- 1))
eval(quote(x))
eval(quote(cyl))
```
Note that `quote()` and `eval()` are basically opposites. In the example below, each `eval()` peels off one layer of quoting.
```{r}
quote(2 + 2)
eval(quote(2 + 2))
quote(quote(2 + 2))
eval(quote(quote(2 + 2)))
eval(eval(quote(quote(2 + 2))))
```
What will this code return?
```{r, eval = FALSE}
eval(quote(eval(quote(eval(quote(2 + 2))))))
```
The second argument to `eval()` controls which environment the code is evaluated in:
```{r}
x <- 10
eval(quote(x))
e <- new.env()
e$x <- 20
eval(quote(x), e)
```
Instead of an environment, the second argument can also be a list or a data frame. This works because an environment is basically a set of mappings between names and values, in the same way as a list or data frame.
```{r}
eval(quote(x), list(x = 30))
eval(quote(x), data.frame(x = 40))
```
This is basically what we want for `subset()`:
```{r}
eval(quote(cyl == 4), mtcars)
eval(quote(vs == am), mtcars)
```
We can combine `eval()` and `substitute()` together to write `subset()`: we can capture the call representing the condition, evaluate it in the context of the data frame, and then use the result for subsetting:
```{r}
subset2 <- function(x, condition) {
condition_call <- substitute(condition)
r <- eval(condition_call, x)
x[r, ]
}
subset2(mtcars, cyl == 6)
```
When you first start using `eval()` it's easy to make mistakes. Here's a common one: forgetting to quote the input:
```{r, error = TRUE}
eval(cyl, mtcars)
# Carefully look at the difference to this error
eval(quote(cyl), mtcars)
```
### Exercises
* The real subset function (`subset.data.frame()`) does two other things to the result. What are they?
* The other component of the real subset function is variable selection. It allows you to work with variable names like they are positions, so you can do things like `subset(mtcars, , -cyl)` to drop the cylinder variable, or `subset(mtcars, , disp:drat)` to select all the variables between `disp` and `drat`. How does select work? I've made it easier to understand by extracting it out into its own function:
```{r}
select <- function(df, vars) {
vars <- substitute(vars)
var_pos <- setNames(as.list(seq_along(df)), names(df))
pos <- eval(vars, var_pos)
df[, pos, drop = FALSE]
}
select(mtcars, -cyl)
```
* What does `evalq()` do? Use it to reduce the amount of typing for the examples above that use both `eval()` and `quote()`
* Write an alternative to `<-` that never overrides an existing binding (refer back to [bindings](#binding) for more details). This would be useful if you are running a test script multiple times and only want to generate the test data once.
## Scoping issues
While it certainly looks like our `subset2()` function works, whenever we're working with expressions instead of values, we need to test a little more carefully. For example, you might expect that the following uses of `subset2()` should all return the same value because each variable refers to the same value:
```{r, error = TRUE}
y <- 4
x <- 4
condition <- 4
condition_call <- 4
subset2(mtcars, cyl == 4)
subset2(mtcars, cyl == y)
subset2(mtcars, cyl == x)
subset2(mtcars, cyl == condition)
subset2(mtcars, cyl == condition_call)
```
What's going wrong? You can get a hint from the variable names I've chosen: they are all variables defined inside `subset2()`. It seems like if `eval()` can't find the variable inside the data frame (it's second argument), it's looking in the function environment. That's obviously not what we want, so we need some way to tell `eval()` to look somewhere else if it can't find the variables in the data frame.
The key is the third argument: `enclos`. This allows us to specify the parent (or enclosing) environment for objects that don't have one like lists and data frames (`enclos` is ignored if we pass in a real environment). The `enclos`ing environment is where any objects that aren't found in the data frame will be looked for. By default it uses the environment of the current function, which is not what we want.
We want to look for `x` in the environment in which `subset` was called. In R terminology this is called the __parent frame__ and is accessed with `parent.frame()`. This is an example of [dynamic scope](http://en.wikipedia.org/wiki/Scope_%28programming%29#Dynamic_scoping). With this modification our function works:
```{r}
subset2 <- function(x, condition) {
condition_call <- substitute(condition)
r <- eval(condition_call, x, parent.frame())
x[r, ]
}
x <- 4
subset2(mtcars, cyl == x)
```
Using `enclos` is just a short cut for converting a list or data frame to an environment with the desired parent yourself. We can use the `list2env()` to turn a list into an environment and explicitly set the parent ourselves:
```{r}
subset2 <- function(x, condition) {
condition_call <- substitute(condition)
env <- list2env(x, parent = parent.frame())
r <- eval(condition_call, env)
x[r, ]
}
x <- 4
subset2(mtcars, cyl == x)
```
When evaluating code in a non-standard way, it's also a good idea to test your code works when run outside of the global environment:
```{r}
f <- function() {
x <- 6
subset(mtcars, cyl == x)
}
f()
```
And indeed it now works.
### Exercises
* `plyr::arrange()` works similarly to `subset()`, but instead of selecting rows, it reorders them. How does it work? What does `substitute(order(...))` do?
* What does `transform()` do? (Hint: read the documentation). How does it work? (Hint: read the source code for `transform.data.frame`) What does `substitute(list(...))` do? (Hint: create a function that does only that and experiment with it).
* `plyr::mutate()` is similar to `transform()` but it applies the transformations sequentially so that transformation can refer to columns that were just created:
```{r, eval = FALSE}
df <- data.frame(x = 1:5)
transform(df, x2 = x * x, x3 = x2 * x)
plyr::mutate(df, x2 = x * x, x3 = x2 * x)
```
How does mutate work? What's the key difference between mutate and transform?
* What does `with()` do? How does it work? (Read the source code for `with.default()`)
* What does `within()` do? How does it work? (Read the source code for `within.data.frame()`). What makes the code so much more complicated than `with()`?
## Calling from another function
Typically, computing on the language is most useful for functions called directly by the user, not by other functions. While `subset` saves typing, it has one big disadvantage: it's now difficult to use non-interactively, e.g. from another function. For example, you might try using `subset()` from within a function that is given the name of a variable and it's desired value:
```{r}
colname <- "cyl"
val <- 6
subset(mtcars, colname == val)
# Zero rows because "cyl" != 6
```
Or imagine we want to create a function that randomly reorders a subset of the data. A nice way to write that function would be to write a function for random reordering and a function for subsetting (that we already have!) and combine the two together. Let's try that:
```{r}
scramble <- function(x) x[sample(nrow(x)), ]
subscramble <- function(x, condition) {
scramble(subset(x, condition))
}
```
But when we run that we get:
```{r, error = TRUE}
subscramble(mtcars, cyl == 4)
# Error in eval(expr, envir, enclos) : object 'cyl' not found
traceback()
# 5: eval(expr, envir, enclos)
# 4: eval(condition_call, x)
# 3: subset(x, condition)
# 2: scramble(subset(x, condition))
# 1: subscramble(mtcars, cyl == 4)
```
What's gone wrong? To figure it out, lets `debug()` subset and work through the code line-by-line:
```{r, eval = FALSE}
> debugonce(subset)
> subscramble(mtcars, cyl == 4)
debugging in: subset(x, condition)
debug: {
condition_call <- substitute(condition)
r <- eval(condition_call, x)
x[r, ]
}
Browse[2]> n
debug: condition_call <- substitute(condition)
Browse[2]> n
debug: r <- eval(condition_call, x)
Browse[2]> condition_call
condition
Browse[2]> eval(condition_call, x)
Error in eval(expr, envir, enclos) : object 'cyl' not found
Browse[2]> condition
Error: object 'cyl' not found
In addition: Warning messages:
1: restarting interrupted promise evaluation
2: restarting interrupted promise evaluation
```
Can you see what the problem is? `condition_call` contains the expression `condition` so when we try to evaluate that it evaluates `condition` which has the value `cyl == 4`. This can't be computed in the parent environment because it doesn't contain an object called `cyl`. If `cyl` is set in the global environment, far more confusing things can happen:
```{r}
cyl <- 4
subscramble(mtcars, cyl == 4)
cyl <- sample(10, 100, rep = T)
subscramble(mtcars, cyl == 4)
```
This is an example of the general tension between functions that are designed for interactive use, and functions that are safe to program with. A function that uses `substitute()` might save typing, but it's difficult to call from another function. As a developer you should also provide an alternative version that works when passed a quoted expression. For example, we could rewrite:
```{r}
subset2_q <- function(x, condition) {
r <- eval(condition, x, parent.frame())
x[r, ]
}
subset2 <- function(x, condition) {
subset2_q(x, substitute(condition))
}
subscramble <- function(x, condition) {
condition <- substitute(condition)
scramble(subset2_q(x, condition))
}
```
I usually suffix these functions with `q` to indicate that they take a quoted call. Most users won't need them so the name can be a little longer.
You might wonder why the function couldn't do this automatically:
```{r, eval = FALSE}
subset <- function(x, condition) {
if (!is.call(condition)) {
condition <- substitute(condition)
}
r <- eval(condition, x)
x[r, ]
}
subset(mtcars, quote(cyl == 4))
subset(mtcars, cyl == 4)
```
But hopefully a little thought, or maybe some experimentation, will show why this doesn't work.
## Substitute
Following the examples above, whenever you write your own functions that use non-standard evaluation, you should always provide alternatives that others can use. But what happens if you want to call a function that uses non-standard evaluation and doesn't have a form that takes expressions? For example, imagine you want to create a lattice graphic given the names of two variables:
```{r, error = TRUE}
library(lattice)
xyplot(mpg ~ disp, data = mtcars)
x <- quote(mpg)
y <- quote(disp)
xyplot(x ~ y, data = mtcars)
```
Again, we can turn to substitute and use it for another purpose: modifying expressions. So far we've just used `substitute()` to capture the unevaluated expression associated with arguments, but it can actually do much much more, and is very useful for manipulating expressions in general.
Unfortunately `substitute()` has a "feature" that makes experimenting with it interactively a bit of a pain: it never does substitutions when run from the global environment, and just behaves like `quote()`:
```{r, eval = FALSE}
a <- 1
b <- 2
substitute(a + b + x)
#> a + b + x
```
But if we run it inside a function, `substitute()` substitutes what it can and leaves everything else the same:
```{r}
f <- function() {
a <- 1
b <- 2
substitute(a + b + x)
}
f()
```
To make it easier to experiment with `substitute()`, `pryr` provides the `subs()` function. It works exactly the same way as `substitute()` except it has a shorter name and if the second argument is the global environment it turns it into a list. Together, this makes it much easier to experiment with substitution:
```{r}
library(pryr)
subs(a + b + x)
```
The second argument (to both `sub()` and `substitute()`) can override the use of the current environment, and provide an alternative list of name-value pairs to use. The following example uses that technique to show some variations on substituting a string, variable name or function call:
```{r}
subs(a + b, list(a = "y"))
subs(a + b, list(a = quote(y)))
subs(a + b, list(a = quote(y())))
```
Remember that every action in R is a function call, so we can also replace `+` with another function:
```{r}
subs(a + b, list("+" = quote(f)))
subs(a + b, list("+" = quote(`*`)))
```
Note that it's quite possible to make nonsense commands with `substitute`:
```{r}
subs(y <- y + 1, list(y = 1))
```
And you can use substitute to insert any arbitrary object into an expression. This is technically ok, but often results in surprising and undesirable behaviour. In the example below, the expression we create doesn't print correctly, but it returns the correct result when we evaluate it:
```{r}
df <- data.frame(x = 1)
(x <- subs(class(df)))
eval(x)
```
Formally, substitution takes place by examining each name in the expression. If the name refers to:
* an ordinary variable, it's replaced by the value of the variable.
* a promise, it's replaced by the expression associated with the promise.
* `...`, it's replaced by the contents of `...` (only if the substitution occurs in a function)
Otherwise the name is left as is.
We can use this to create the right call to `xyplot`:
```{r}
x <- quote(mpg)
y <- quote(disp)
subs(xyplot(x ~ y, data = mtcars))
```
It's even simpler inside a function, because we don't need to explicitly quote the x and y variables. Following the rules above, `substitute()` replaces named arguments with their expressions, not their values:
```{r}
xyplot2 <- function(x, y, data = data) {
substitute(xyplot(x ~ y, data = data))
}
xyplot2(mpg, disp, data = mtcars)
```
If we include `...` in the call to substitute, we can add additional arguments to the call:
```{r}
xyplot3 <- function(x, y, ...) {
substitute(xyplot(x ~ y, ...))
}
xyplot3(mpg, disp, data = mtcars, col = "red", aspect = "xy")
```
### Non-standard evaluation in substitute
One application of this idea is to make a version of `substitute` that evaluates its first argument (i.e. a version that uses standard evaluation). Note the following example:
```{r}
x <- quote(a + b)
substitute(x, list(a = 1, b = 2))
```
Instead we can use `pryr::substitute2`:
```{r}
x <- quote(a + b)
substitute2(x, list(a = 1, b = 2))
```
The implementation of `substitute2` is short, but deep:
```{r}
substitute2 <- function(x, env) {
call <- substitute(substitute(y, env), list(y = x))
eval(call)
}
```
Let's work through the example above: `substitute2(x, list(a = 1, b = 2))`. It's a little tricky because of `substitute()`'s non-standard evaluation rules, we can't use the usual technique of working through the parentheses inside-out.
1. First `substitute(substitute(y, env), list(y = x))` is evaluated. The first argument is specially evaluated in the environment containing only one item, the value of `x` with the name `y`. Because we've put `x` inside a list, it will be evaluated and the rules of substitute will replace `y` with it's value. This yields the expression `substitute(a + b, env)`
2. Next we evaluate that expression inside the current function. `substitute()` specially evaluates its first argument, and looks for name value pairs in `env`, which evaluates to `list(a = 1, b = 2)`. Those are both values (not promises) so the result will be `a + b`
### Capturing unevaluated ... {#capturing-dots}
Another frequently useful technique is to capture all of the unevaluated expressions in `...`. Base R functions do this in many ways, but there's one technique that works well in a wide variety of situations:
```{r}
dots <- function(...) {
eval(substitute(alist(...)))
}
```
This uses the `alist()` function which simply captures all its arguments. This function is the same as `pryr::dots()`, and pryr also provides `pryr::named_dots()`, which ensures all arguments are named, using the deparsed expressions as default names.
## The downsides of non-standard evaluation
There are usually two principles you can follow when modelling the evaluation of R code:
* If the underlying values are the same, the results will be the same. i.e. the three results will all be the same:
```{r, eval = FALSE}
x <- 10; y <- 10
f(10); f(x); f(y)
```
* You can model evaluation by working from the innermost parentheses to the outermost.
Non-standard evaluation can break both principles. This makes the mental model needed to correctly predict the output much more complicated, so it's only worthwhile to do so if there is significant gain.
For example, `library()` and `require()` allow you to call them either with or without quotes, because internally they use `deparse(substitute(x))` plus a couple of tricks. That means that these two lines do exactly the same thing:
```{r, eval = FALSE}
library(ggplot2)
library("ggplot2")
```
However, things start to get complicated if the variable has a value. What do you think the following lines of code will do?
```{r, eval = FALSE}
ggplot2 <- "plyr"
library(ggplot2)
```
It loads ggplot2, not plyr. If you want to load plyr (the value of the ggplot2 variable), you need to use an additional argument:
```{r, eval = FALSE}
library(ggplot2, character.only = TRUE)
```
Using an argument to change the behaviour of another argument is not a great idea because it means you must completely and carefully read all of the function arguments to understand what one function argument means. You can't understand the effect of each argument in isolation, and hence it's harder to reason about the results of a function call.
There are a number of other R functions that use `substitute()` and `deparse()` in this way: `ls()`, `rm()`, `data()`, `demo()`, `example()`, `vignette()`. These all use non-standard evaluation and then have special ways of enforcing the usual rules. To me, eliminating two quotes is not worth the cognitive cost of non-standard evaluation, and I don't recommend you use `substitute()` for this purpose.
One situtation where non-standard evaluation is more useful is `data.frame()`, which uses the input expressions to automatically name the output variables if not otherwise provided:
```{r}
x <- 10
y <- "a"
df <- data.frame(x, y)
names(df)
```
I think it is worthwhile in `data.frame()` because it eliminates a lot of redundancy in the common scenario when you're creating a data frame from existing variables, and importantly, it's easy to override this behaviour by supplying names for each variable.
The code for `data.frame()` is rather complicated, but we can create our own simple version for lists to see how a function that does this might work. The key is `pryr::named_dots()`, a function which returns the unevaluated ... arguments, with default names. Then it's just a matter of arranging the evaluated results in a list:
```{r}
list2 <- function(...) {
dots <- named_dots(...)
lapply(dots, eval, parent.frame())
}
x <- 1; y <- 2
list2(x, y)
list2(x, z = y)
```
## Applications
To show how I've used some of these ideas in practice, the following two sections show applications of non-standard evaluation to plyr and ggplot2.
### `plyr::.` and `ggplot2::aes`
Both plyr and ggplot2 have ways of capturing what you want to do, and then performing that action later. ggplot2 uses the `aes()` to define a set of mappings between variables in your data and visual properties on your graphic. plyr uses the `.` function to capture the names (or more complicated expressions) of variables used to split a data frame into pieces. Let's look at the code:
```{r}
. <- function (..., .env = parent.frame()) {
structure(
as.list(match.call()[-1]),
env = .env,
class = "quoted"
)
}
aes <- function (x = NULL, y = NULL, ...) {
aes <- structure(
as.list(match.call()[-1]),
class = "uneval")
class(aes) <- "uneval"
ggplot2:::rename_aes(aes)
}
```
Both functions were written when I didn't know so much about non-standard evaluation, and if I was to write them today, I'd use the `dots()` helper function I showed previously. I'd also think more about the environment in which the results of `aes()` should be evaluated, and how that integrates with ggplot2's rules for aesthetic mapping inheritance. That's a bit murky at the moment and leads to confusion when creating complex graphics across multiple functions.
ggplot2 and plyr provide slightly different ways to use standard evaluation so that you can refer to variables by reference. ggplot2 provides `aes_string()` which allows you to specify variables by the string representation of their name, and plyr uses S3 methods so that you can either supply an object of class quoted (as created with `.()`), or a regular character vector.
### Plyr: summarise, mutate and arrange
The plyr package also uses non-standard evaluation to complete the set of tools provided by the base `subset()` and `transform()` functions with `mutate()`, `summarise()` and `arrange()`. Each of these functions has the same interface: the first argument is a data frame and the subsequent arguments are evaluated in the context of that data frame (i.e. they look there first for variables, and then in the current environment) and they return a data frame.
The following code shows the essence of how these four functions work:
```{r}
subset2 <- function(.data, subset) {
sub <- eval(substitute(subset), .data, parent.frame())
sub <- sub & !is.na(sub)
.data[sub, , drop = FALSE]
}
arrange2 <- function (.data, ...) {
ord <- eval(substitute(order(...)), .data, parent.frame())
.data[ord, , drop = FALSE]
}
mutate2 <- function(.data, ...) {
cols <- named_dots(...)
data_env <- eval_df(.data, parent.frame(), cols)
out_cols <- union(names(.data), names(cols))
quickdf(mget(out_cols, data_env))
}
summarise2 <- function (.data, ...) {
cols <- named_dots(...)
data_env <- eval_df(.data, parent.frame(), cols)
quickdf(mget(names(cols), env))
}
eval_df <- function(data, env, expr) {
data_env <- list2env(data, parent = env)
for(nm in names(exprs)) {
data_env[[nm]] <- eval(data_env[[nm]], env)
}
data_env
}
```
You might be surprised to see the for loops in `eval_df`, but they are necessary because the computation of one variable might depend on the results of previous variables (this is the key difference between `mutate()` and `transform()`).
Combined with a by operator (e.g. `ddply()`) these four functions allow you to express the majority of data manipulation operations. Then when you have a new problem, solving it becomes a matter of thinking about which operations you need to apply and in what order. The realm of possible actions has been shrunk to a manageable number.
## Conclusion
Now that you understand how our version of subset works, go back and read the source code for `subset.data.frame`, the base R version which does a little more. Other functions that work similarly are `with.default`, `within.data.frame`, `transform.data.frame`, and in the plyr package `.`, `arrange`, and `summarise`. Look at the source code for these functions and see if you can figure out how they work.