fasteR.html

<!DOCTYPE html>
<html>
  <head>
    <title>fasteR</title>
    <meta charset="utf-8">
    <meta name="author" content="Thomas Lumley" />
    <meta name="date" content="2018-07-10" />
    <link href="fasteR_files/remark-css-0.0.1/default.css" rel="stylesheet" />
    <link href="fasteR_files/remark-css-0.0.1/default-fonts.css" rel="stylesheet" />
  </head>
  <body>
    <textarea id="source">
class: center, middle, inverse, title-slide

# <em>fasteR</em>
### Thomas Lumley
### 10 July 2018

---


### Who am I?

- Biostatistician
- used R since 1996
   - simulations
   - air pollution data
   - genomic data
   - big surveys
- wrote the R memory profiler


---

### The Weka wants you to ask questions!

&lt;img src="weka-low.jpg" width=400&gt;

---

## Slow R is slow

R is slow because

- it's flexible: things can be redefined dynamically
- it passes arguments by value
- the bottlenecks in hardware have changed since R started. 
- R Core don't have the resources to do optimisations that break lots of  things.


---

## Speeding up R code

**Syllabus**

- Timing and profiling
- Memory allocation
- Matrices and lists vs data frames
- Slow functions to watch out for
- Vectorisation (and memory tradeoffs)
- Embarassingly Parallel-isation
- The right tools: databases, netCDF, sparse matrices

**Epilogue**

Randomness: SSVD, subsampling

---


## Old White Guys

&gt; *We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.* (Knuth)

&gt; *Life is short, the craft long; opportunity fleeting, experiment perilous, judgement difficult* (Hippocrates of Kos)

&gt; *R has changed quite a lot recently, and older preconceptions do need to be checked against current information.* (Brian D. Ripley)


---

## Strategy

&lt;img src="faster-graph.png"&gt;


---

## Strategy

![](xkcd-optimise.png)

https://xkcd.com/1205/
---

## Timing and profiling

*Measure* your code to find out what needs to be optimised -- if anything.

**Tools**: `Rprof`, `Rprofmem`, `tracemem`, `profvis`, `microbenchmark`

### Note: the compiler

In recent R versions, the Just-In-Time compiler defaults to 'on' and code gets faster as you run it. 

- `compiler:::enableJIT(0)` to turn it off
- or run a few times before timing. 

---

## Rprof() and profvis

A time-sampling profiler: takes notes on the state of R every so often (50ms default)

- Stack trace
- Optionally, current memory usage and number of vector duplications

`profvis` is a package for displaying `Rprof()` output better

---

## Rprofmem()

A memory-allocation profiler: writes a stack trace at every

- allocation of a 'large' object (&gt;128bytes *[sic]*)
- allocation of a page on the R heap

Most useful if you need to trace allocations of a particular **size** (eg columns of your data)

---

## tracemem()

Marks an object so that a message is printed when the object is duplicated.

Useful for a single big object where duplications hurt.

---

## microbenchmark()

High-resolution timing for small bits of code

- useful for learning about principles
- typically not relevant for optimising real code


---

## "Growing" objects

Dumb example

```
&gt; system.time({
+ 	x&lt;-integer(0)
+ 	for(i in 1:1e5) 
+ 		x&lt;-c(x,i^2)
+ })
   user  system elapsed 
 19.601   4.804  24.644 
```

---

## Preallocating

Dumb example

```
&gt; system.time({
+ 	x&lt;-integer(1e5)
+ 	for(i in 1:1e5) 
+ 		x[i]&lt;-i^2
+ })
   user  system elapsed 
  0.047   0.011   0.058 
```

### (Vectorising)

```
&gt; microbenchmark(x&lt;-(1:1e5)^2)
Unit: microseconds
             expr     min      lq     mean  median      uq      max neval
 x &lt;- (1:1e+05)^2 268.401 325.831 1037.601 416.456 769.639 36868.98   100
```

---

## Real example

- Computing accurate `\(p\)`-values for DNA-sequence association tests
- need  `\(\sum_i \lambda_i \chi^2_1\)` distributions
- algorithm using symbolic operations on a basis of gamma functions [Bausch, 	arXiv:1208.2691]

---

## Old (90% of time in "c")

```
  for(i in 1:length(x@power)){
         for(j in 1:length(y@power)){
             if (x@exp[i]==y@exp[j]) {cat("#");next}
             term&lt;-convone(x@exp[i],x@power[i],y@exp[j],y@power[j])
*             allcoef&lt;-c(allcoef,term@coef*x@coef[i]*y@coef[j])
*             allpower&lt;-c(allpower,term@power)
*             allexp&lt;-c(allexp,term@exp)
         }
     }
     
     new("gammaconv", coef=allcoef,power=allpower,exp=allexp)
```

---

## New

```
    for(i in 1:length(x@power)){
        for(j in 1:length(y@power)){
            term&lt;-convone(x@exp[i],x@power[i],y@exp[j],y@power[j])
*            size&lt;-nrow(term)
            if(here+size&gt;maxterms) stop("thomas can't count")
*            allcoef[here+(1:size)]&lt;-term@coef*x@coef[i]*y@coef[j]
*            allpower[here+(1:size)]&lt;-term@power
*            allexp[here+(1:size)]&lt;-term@exp
*            here&lt;-here+size
        }
    }
    
    new("gammaconv", coef=allcoef[1:here],power=allpower[1:here],
      exp=allexp[1:here])
```

Or preallocate a list, put short vectors in as elements, use `do.call(c, the_list)` at the end

---

## Memory copying

- "Pass by value illusion": R behaves as if functions get **copies** of their arguments
- Actually not copied unless modified
- *Sometimes* not copied, if R *knows* it doesn't need to.
    - .Primitive operations 
    - on local variables 
    - that have never been passed to a function

---

R doesn't know `x` is safe

```
&gt; system.time({
*+ 	touch&lt;-function(z) {force(z); NULL}
+ 	x&lt;-integer(1e5)
+ 	y&lt;-integer(1e5)
+ 	for(i in 1:1e5){ 
*+ 		touch(x)
+ 		x[i]&lt;-i^2
+ 		}
+ })
   user  system elapsed 
 36.195  10.854  47.061 
```

---

R does know `x` is safe

```
&gt; system.time({
*+ 	touch&lt;-function(z) {force(z); NULL}
+ 	x&lt;-integer(1e5)
+ 	y&lt;-integer(1e5)
+ 	for(i in 1:1e5){ 
*+ 		touch(y)
+ 		x[i]&lt;-i^2
+ 		}
+ })
   user  system elapsed 
  0.166   0.025   0.190 
```

---

## Dataframes are slower

```
&gt; dim(acsdf)
[1] 151885    298
&gt; microbenchmark(sum(acsdf[,100]))
Unit: microseconds
              expr   min     lq     mean median    uq    max neval
 sum(acsdf[, 100]) 7.654 8.6315 11.29369  9.117 9.784 75.909   100
&gt; microbenchmark(sum(acslist[[100]]))
Unit: nanoseconds
                expr min    lq   mean median    uq   max neval
 sum(acslist[[100]]) 330 340.5 740.67    447 561.5 10817   100
```

---

## Dataframes are slower


```
&gt; str(sequence)
 num [1:5000, 1:4028] 0 0 0 0 0 0 0 0 0 0 ...
&gt; system.time(colMeans(sequence))
   user  system elapsed 
  0.026   0.000   0.026 
&gt; system.time(colMeans(sequencedf))
   user  system elapsed 
  0.186   0.059   0.246 
&gt; microbenchmark(sequence[200:250,200:250])
Unit: microseconds
                       expr
 sequence[200:250, 200:250]
              min     lq     mean  median      uq    max neval
              8.938 9.4135 11.76206 10.0835 10.4745 34.582   100
&gt; microbenchmark(sequencedf[200:250,200:250])
Unit: microseconds
                         expr     
 sequencedf[200:250, 200:250] 
             min      lq     mean  median       uq      max neval
             437.748 484.339 1565.898 567.917 666.1715 95096.32   100
  
```
---

## Other alternatives

`data.table` and `tbl` are both faster. 

They both claim to be drop-in replacements for `data.frame`. 

Neither actually is, but they're *close*, and if you're editing the code anyway...


---

## Encodings

There's no such thing as plain text

- UTF-8
- Latin1
- *your local encoding*
- plain bytes

UTF-8 is flexible, general, and space-efficient, but has variable character lengths.

---


```r
library(microbenchmark)
load("encoding.rda")
substr(xx,1,10)
```

```
## [1] "façilefaçi"
```

```r
substr(yy,1,10)
```

```
## [1] "façilefaçi"
```

```r
substr(zz,1,10)
```

```
## [1] "fa\\xe7ilefa\\xe7i"
```

```r
c(Encoding(xx), Encoding(yy),Encoding(zz))
```

```
## [1] "latin1" "UTF-8"  "bytes"
```

---


```r
microbenchmark(grep("user!", xx))
```

```
## Unit: milliseconds
##               expr      min       lq     mean   median       uq      max
##  grep("user!", xx) 2.056194 2.099627 2.393907 2.148303 2.432823 4.482882
##  neval
##    100
```

```r
microbenchmark(grep("user!", yy))
```

```
## Unit: milliseconds
##               expr      min       lq     mean   median      uq      max
##  grep("user!", yy) 1.575402 1.704225 2.477021 1.745132 1.94023 53.87471
##  neval
##    100
```

```r
microbenchmark(grep("user!", zz))
```

```
## Unit: microseconds
##               expr     min       lq     mean median       uq      max
##  grep("user!", zz) 771.832 774.4955 898.7729 834.89 1019.222 1483.425
##  neval
##    100
```

---


```r
microbenchmark(substr(xx,1000,1100))
```

```
## Unit: microseconds
##                    expr   min     lq    mean median     uq    max neval
##  substr(xx, 1000, 1100) 3.556 4.5345 5.03828  4.606 4.7285 32.118   100
```

```r
microbenchmark(substr(yy,1000,1100))
```

```
## Unit: microseconds
##                    expr   min     lq    mean median    uq    max neval
##  substr(yy, 1000, 1100) 8.938 9.0155 9.26516   9.07 9.128 25.234   100
```

```r
microbenchmark(substr(zz,1000,1100))
```

```
## Unit: microseconds
##                    expr   min     lq    mean median     uq   max neval
##  substr(zz, 1000, 1100) 3.094 3.1695 3.56293 3.2425 3.4175 22.26   100
```

---

## Matrices: BLAS

The *Basic Linear Algebra Subsystem* has the elementary matrix and vector operations in standardised form

- Matrix operations in R are already in efficient C
- Data flow onto the CPU is main bottleneck
- Using an optimised version of BLAS can still help **a lot**
    - Apple vecLib,  Intel MKL, OpenBLAS
- Don't write your own linear algebra, let the professionals do it.    

---

### Reference BLAS

```
x&lt;-matrix(rnorm(1000*2000),ncol=1000)
&gt; system.time(m&lt;-crossprod(x))
   user  system elapsed 
  1.040   0.002   1.043 
&gt; system.time(solve(m))
   user  system elapsed 
  1.116   0.008   1.131 
```

### Apple vecLib BLAS

```
&gt; x&lt;-matrix(rnorm(1000*2000),ncol=1000)
&gt; system.time(m&lt;-crossprod(x))
   user  system elapsed 
  0.399   0.007   0.112 
&gt; system.time(solve(m))
   user  system elapsed 
  0.244   0.014   0.105 
```

---

## Slow functions

`ifelse()`, `pmax()`, `pmin()`

Why? Flexibility (eg, what is the output type?)

Instead of 

```
x &lt;- ifelse(y, a, b)
```

try

```
x&lt;-a
x[y]&lt;-b[y]
```

---

### Example

```
x&lt;-1:1000
z&lt;-1000:1

microbenchmark(ifelse(x+z&gt;1000,x,z))
microbenchmark({i&lt;-(x+z&gt;1000); y&lt;-x;y[i]&lt;-z[i]})

x&lt;-rnorm(10000)
microbenchmark(pmax(0,pmin(x,1)))
microbenchmark(
{i&lt;-x&gt;1
x[i]&lt;-1
i&lt;-x&lt;0
x[i]&lt;-0}
)
```

---

## Slow functions

`lm`, `glm` are user-friendly wrappers for `lm.fit`, `lm.wfit`, `glm.fit`

Can be worth constructing the design matrix yourself if

- it's big
- there are a lot of them and they're simple
- you know enough linear algebra to use the functions

For **very** large cases, try the `biglm` package.

---

## NA, NaN, Inf

```
&gt; x&lt;-matrix(rnorm(1e7),ncol=1e3)
&gt; system.time(cor(x))
   user  system elapsed 
  6.255   0.022   6.281 
&gt; system.time(crossprod(scale(x)))
   user  system elapsed 
  1.377   0.195   1.049 
```

With no allowance for missing data these would be the same computation.

R can't assume complete data. **You** may be able to.


---

## Vectorisation

Try to use vector and matrix operations in R's internals, even if it's theoretically inefficient

&lt;img src="vectorise.jpeg" height=300&gt;

---

### Convolution

`$$z_i=\sum_{j+k=i}x_iy_i$$`

If `\(x_i=\Pr(X=i)\)` and `\(y_j=\Pr(Y=j)\)`, then `\(z_i=\Pr(X+Y=i)\)`

```
conv1&lt;-function(x,y){
m &lt;- length(x)
n &lt;- length(y)
z &lt;- numeric(m+n-1)
for(j in 1:m){
	for (k in 1:n){
		z[j+k-1] &lt;- z[j+k-1] + x[j]*y[k]
		}
	}
	z
}	
```


---


## Partly vectorised	
	
```
conv2&lt;-function(x,y){
m &lt;- length(x)
n &lt;- length(y)
z &lt;- numeric(m+n-1)
for(j in 1:m){
		z[j+(1:n)-1] &lt;- z[j+(1:n)-1] + x[j]*y[(1:n)]
	}
	z
}
```


---

### Fully vectorised

```
conv3&lt;-function(x,y){
  m &lt;- length(x)
  n &lt;- length(y)

  xy&lt;-outer(x,y,"*")
  xyshift&lt;-matrix(rbind(xy,matrix(0,n,n))[1:((m+n-1)*n)],ncol=n)
  rowSums(xyshift)
}
```


---

### Timings

```
&gt; system.time(conv1(x,y))
   user  system elapsed 
 19.440   0.159  19.585 
&gt; system.time(conv2(x,y))
   user  system elapsed 
  0.533   0.116   0.650 
&gt; system.time(conv3(x,y))
   user  system elapsed 
  0.876   0.051   0.891 
&gt; compiler::enableJIT(3)
[1] 0
*&gt; system.time(conv1(x,y))
*   user  system elapsed 
*  1.722   0.065   1.795 
&gt; system.time(conv2(x,y))
   user  system elapsed 
  0.518   0.164   0.684 
&gt; system.time(conv3(x,y))
   user  system elapsed 
  0.848   0.051   0.863 
```

Your mileage may vary: try it

---

### Example: power calculation

Power for detecting a 2mmHg blood pressure change with standard deviation 7mmHg

```
system.time({
my.sims &lt;- replicate(10000, {
    mydiff &lt;- rnorm(100, 2, 7)
    t.test(mydiff)$p.value
})
})
```

**How do we speed it up?**

### talk to your neighbour

---


## Example: genetic permutation test 

10 genetic variants (SNPs) in a gene

Regress blood pressure on each one (adjusted for age, sex)

Is largest `\(Z\)`-statistic interestingly large? For genome-wide values of interesting?

---

 ### Simple code

```
one.snp = function(snp, perm, df){
    coef(summary(lm(sbp~as.numeric(snp)[perm]+sex,data=df)))[2,3]
}

one.gene&lt;-function(snps, perm, df){
    p&lt;-ncol(snps)
    zs&lt;-sapply(1:p, function(i) one.snp(snps[,i], perm, df))
    max(abs(zs))
}

one.perm&lt;-function(snps, df){
    n&lt;-nrow(snps)
    one.gene(snps,perm=sample(1:n), df)
}

many.perm &lt;- replicate(1000, one.perm(snps,  phenotype))

real.max.Z &lt;- one.gene(snps,1:nrow(snps),phenotype)
mean(many.perm &lt; real.max.Z)
```

---

## Where will it be slow?

### (talk to your neighbour)


---

## Parallel

Relatively unusual in stats for explicit parallel processing to be *faster* than just running lots of copies of R (can be simpler)

- if parallel section is short, explicit parallelisation may be less expensive (lower average load)
- **shared memory** may allow more R processes to fit in memory

---

```
x&lt;-matrix(rnorm(1e8),ncol=4)
y&lt;-rnorm(nrow(x))
&gt; system.time(mclapply(1:4, function(i) { cor(x[,i],y)},mc.cores=4))
   user  system elapsed 
  3.024   1.192   1.142 
&gt; system.time(lapply(1:4, function(i) { cor(x[,i],y)}))
   user  system elapsed 
  1.797   0.272   2.077 
```

Common tradeoff: more CPU-seconds used, less elapsed time.

---

## Variations

Redoing setup work

```
&gt; system.time(mclapply(1:4, 
+   function(i) {
+      set.seed(1) 
+      x&lt;-matrix(rnorm(1e8),ncol=4)
+      y&lt;-rnorm(nrow(x))
+      cor(x[,i],y)
+      },mc.cores=4))

   user  system elapsed 
 34.158   7.689  32.162 
``` 

---

Forcing memory duplication

```
x&lt;-matrix(rnorm(1e8),ncol=4)
y&lt;-rnorm(nrow(x))
&gt; system.time(mclapply(1:4, 
   function(i) {x[1]&lt;-x[1]
      cor(x[,i],y)
    },mc.cores=4))
   user  system elapsed 
  3.100   2.138   4.843 
```

---


## Data storage

- large data frames in database tables
- large arrays in netCDF or similar
- large sparse matrices in sparse formats


---

### Database storage

The American Community Survey five-year person file

- 16GB of CSVs
- over 8 million records
- won't fit in R on laptop

Use MonetDB for simple database analyses, behind `dbplyr`

(Data reading took 5-10 minutes)

---

## Example

```
library(MonetDBLite)
library(dplyr)
library(dbplyr)

ms &lt;- MonetDBLite::src_monetdblite("~/acs")
acstbl&lt;-tbl(ms, "acs")
acstbl %&gt;% summarise(count(agep))
acstbl %&gt;% group_by(st) %&gt;% summarise(avg(agep))

acstbl %&gt;% group_by(st) %&gt;% summarise(sum(pwgtp*agep*1.0)/sum(pwgtp))
```

MonetDB is optimised for this sort of computation on a single computer: advantageous when data size ~ 1/3 physical memory

---

## Sparse matrices

In some applications, there are large matrices with mostly zero entries. 

The zeroes don't need to be stored: this can help a lot with memory use and matrix multiplication speed.  

Support in the `Matrix` package

Example: (simulated) human DNA sequence data: 5000 people, 4028 variants

---

```
&gt; load("AKLfaster/sequence.rda")
&gt; str(sequence)
 num [1:5000, 1:4028] 0 0 0 0 0 0 0 0 0 0 ...
&gt; library(Matrix)
&gt; M&lt;-Matrix(sequence)
&gt; str(M)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:287350] 713 4006 12 20 37 41 67 80 102 103 ...
  ..@ p       : int [1:4029] 0 2 468 934 935 937 1032 1034 1035 1037 ...
  ..@ Dim     : int [1:2] 5000 4028
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:287350] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()
&gt; object.size(M)
3465736 bytes
&gt; object.size(sequence)
161120200 bytes
&gt; system.time(crossprod(M))
   user  system elapsed 
  0.207   0.028   0.238 
&gt; system.time(crossprod(sequence))
   user  system elapsed 
  5.493   0.052   1.614 
```

---

### "Matrix-free" operations

Many uses of matrices (iterative solvers, eigenthingies) don't need the individual elements, just the ability to compute `\(y\mapsto My\)`.

Fast for

- Sparse matrices
- Projections

Trivial: `\(n\)` vs `\(n^2\)` operations
```
y-mean(y)
```

Less trivial: `\(np^2+p^3\)` vs `\(n^2\)` operations


```
qr.resid(qr(X),y)
```


---

```
&gt; x&lt;-rnorm(5000)
&gt; system.time({
+ P&lt;-matrix(-1/5000,ncol=5000,nrow=5000)
+ diag(P)&lt;-diag(P)+1
+ P%*%x
+ })
   user  system elapsed 
  0.473   0.185   0.646 

&gt; microbenchmark(x-mean(x))
Unit: microseconds
        expr    min      lq     mean  median     uq    max neval
 x - mean(x) 18.663 19.2265 19.80241 19.4575 19.747 46.837   100
&gt; microbenchmark(x-sum(x)/5000)
Unit: microseconds
            expr    min      lq     mean  median      uq    max neval
 x - sum(x)/5000 10.792 20.1035 24.52196 25.0765 28.9615 41.306   100
```
---


## Conway's "Life"


Cellular automaton:

- grid of cells,'alive' or 'dead'
- cells with 0,1,5,6 neighbours 'die'
- empty cells with 3 neighbour 'born'

**Look at `life-basic.R` and think about optimisation.**

### talk to your neighbour


---

## Epilogue: random algorithms

Statisticians should know that sampling works

- Sampling from a database
   - including stratified sampling, case-control sampling
- random projections for nearest neighbours
- Stochastic SVD

---

## Stochastic SVD

- Principal components on an `\(n\times m\)` matrix takes `\(nm^2\)` time
- Statisticians usually want `\(k\ll m\)` dimensions
- Take random `\(k+p\)` dimensional projection `\(\Omega\)`
    - compute `\(A\Omega\)`
    - project `\(A\)` on to the subspace spanned by `\(A\Omega\)`
    - SVD the projection
- Gives almost the same leading `\(k\)` singular vectors, for small `\(p\)`.
- Takes only `\(nm(k+p)\)` time

**Only** accurate if `\(m\)`, `\(n\)` large, `\(k+p\)` at least moderate

---

### Example 

```  
load("sequence.rda")
# devtools::install_github("tslumley/bigQF")
library(bigQF)

system.time(s1&lt;-svd(sequence,nu=10,nv=10))
system.time(s2&lt;-ssvd(sequence, n=10,U=TRUE,V=TRUE))
spseq&lt;-sparse.matrixfree(Matrix(sequence))
system.time(s3&lt;-ssvd(spseq, n=10,U=TRUE,V=TRUE))

plot(s1$u[,1:2],pch=19)
points(s2$u[,1:2],col="red")

plot(s1$u[,1:2],pch=19)
points(s3$v[,1],-s3$v[,2],col="green")
```    

---

## Subsampling plus polishing

For generalized linear models on large datasets in a database

- fit the model to a random subset
- do a one-step update computed in the database

My talk on Thursday

---


# Simple Rule: There are no simple rules

### Measure twice; debug once

### Vectorise

### Experiment

### Keep up with modern tools
    </textarea>
<script src="libs/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightLines": true
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function() {
  var d = document, s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})();</script>

<script type="text/x-mathjax-config">
MathJax.Hub.Config({
  tex2jax: {
    skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
  }
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://cdn.bootcss.com/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>