Skip to content

Commit

Permalink
Add post
Browse files Browse the repository at this point in the history
  • Loading branch information
mikemahoney218 committed Oct 24, 2023
1 parent 483689a commit 39e2c02
Show file tree
Hide file tree
Showing 5 changed files with 172 additions and 1 deletion.
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"hash": "3da9568db9d374209e8c71e3f521de3b",
"result": {
"markdown": "---\ntitle: \"Pre-allocating vectors is for nerds\"\ndescription: \"Or rather: growing objects is inefficient. But it's maybe not as big a deal as I'd believed.\"\nauthor:\n - name: Mike Mahoney\n url: {}\ndate: \"2023-08-29\"\ncategories: [R, Tutorials]\nimage: banner.jpg\nformat: \n html:\n toc: true\nengine: knitr\n---\n\n\nThe second circle of R hell, in [Patrick Burns' seminal book The R Inferno](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf), is titled \"Growing Objects\". This refers to a common antipattern for R users, usually among the first things taught when dealing with iteration: it is extremely inefficient to grow a vector using `c()`, like so:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector_c <- function(n) {\n out <- c()\n for (i in 1:n) {\n out <- c(out, i)\n }\n out\n}\n```\n:::\n\n\nInstead, Burns says, it is better to pre-allocate our vector `out`, and assign our function's output to a specific position in `out` using either `[` or `[[`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector_prealloc_one_bracket <- function(n) {\n out <- vector(\"numeric\", n)\n for (i in 1:n) {\n out[i] <- i\n }\n out\n}\n\nvector_prealloc_two_bracket <- function(n) {\n out <- vector(\"numeric\", n)\n for (i in 1:n) {\n out[[i]] <- i\n }\n out\n}\n```\n:::\n\n\nOf course, it would be better yet to avoid our loop entirely, and simply create our final object using the colon operator:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolon_operator <- function(n) {\n 1:n\n}\n```\n:::\n\n\nBut that's beside the point right now.\n\nThis advice was originally written in 2011, but is even more important today. In Burns' book, subsetting is roughly 7 times faster when `n` is 10,000; on my computer today, subsetting is roughly 200 times faster:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn <- 10000\nbench::mark(\n c = vector_c(n),\n one_bracket = vector_prealloc_one_bracket(n),\n two_brackets = vector_prealloc_two_bracket(n),\n colon = colon_operator(n),\n filter_gc = FALSE\n)[c(\"expression\", \"median\", \"itr/sec\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 4\n expression median `itr/sec` mem_alloc\n <bch:expr> <bch:tm> <dbl> <bch:byt>\n1 c 57.8ms 17.1 191.2MB\n2 one_bracket 270.9µs 3596. 99.1KB\n3 two_brackets 271.2µs 3570. 96.7KB\n4 colon 371.1ns 1841212. 0B\n```\n:::\n:::\n\n\nBut what if `n` is unknowable? Well, to quote Burns:\n\n> Often a reasonable upper bound on the size of the final object is known. If so,\nthen create the object with that size and then remove the extra values at the\nend. If the final size is a mystery, then you can still follow the same scheme,\nbut allow for periodic growth of the object.\n\nThis is still probably a decent approach: over-allocate and trim down, or allocate in chunks and only grow when those chunks are exhausted.\n\nOr... perhaps we might try growing a vector with `[` or `[[`, rather than with `c()`? To anyone raised on R traditions, this might seem like a code smell:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector_unalloc_one_bracket <- function(n) {\n out <- c()\n for (i in 1:n) {\n out[i] <- i\n }\n out\n}\n\nvector_unalloc_two_bracket <- function(n) {\n out <- c()\n for (i in 1:n) {\n out[[i]] <- i\n }\n unlist(out)\n}\n```\n:::\n\n\nBut if we test it out:^[I dropped `prealloc_two_brackets` from the benchmarks because it was performing ~the same as the one-bracket alternative.]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbench::mark(\n c = vector_c(n),\n prealloc_one_bracket = vector_prealloc_one_bracket(n),\n unalloc_one_bracket = vector_unalloc_one_bracket(n),\n unalloc_two_brackets = vector_unalloc_two_bracket(n),\n filter_gc = FALSE\n)[c(\"expression\", \"median\", \"itr/sec\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 4\n expression median `itr/sec` mem_alloc\n <bch:expr> <bch:tm> <dbl> <bch:byt>\n1 c 61.23ms 14.8 191.23MB\n2 prealloc_one_bracket 269.86µs 3632. 78.17KB\n3 unalloc_one_bracket 1.27ms 679. 871.73KB\n4 unalloc_two_brackets 2.86ms 316. 1.72MB\n```\n:::\n:::\n\n\nGrowing a vector via `[` is still notably slower than assigning values to a pre-allocated vector; it looks like it's roughly ~5 times slower. But that still means it's ~50 times faster than growing a vector via `c()`, and allocates ~200 times less memory to do so. Growing a vector via `[[` isn't quite as efficient -- taking roughly twice the time and memory as `[` here -- but still blows `c()` out of the water.\n\nThat's not too shabby, for a code smell. How does a method like `vapply()` compare?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvapply_lambda <- function(n) {\n vapply(1:n, \\(i) i, numeric(1))\n}\n\nbench::mark(\n c = vector_c(n),\n prealloc_one_bracket = vector_prealloc_one_bracket(n),\n unalloc_one_bracket = vector_unalloc_one_bracket(n),\n unalloc_two_brackets = vector_unalloc_two_bracket(n),\n vapply = vapply_lambda(n),\n filter_gc = FALSE\n)[c(\"expression\", \"median\", \"itr/sec\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 4\n expression median `itr/sec` mem_alloc\n <bch:expr> <bch:tm> <dbl> <bch:byt>\n1 c 51.14ms 19.2 191.2MB\n2 prealloc_one_bracket 268.72µs 3643. 78.2KB\n3 unalloc_one_bracket 1.33ms 567. 853KB\n4 unalloc_two_brackets 2.72ms 341. 1.7MB\n5 vapply 3.44ms 270. 78.2KB\n```\n:::\n:::\n\n\n`vapply()` uses as little memory as our pre-allocation approaches, but is slower than either of our un-allocated methods.^[Usual disclaimer that this is probably not a type of slowness that matters for your code, that you should look into moving computation to C++/Rust if you care about a few milliseconds execution time, and that the real benefits of *apply functions come from readability and their potential for parallelization, not speed.]\n\nIt's worth emphasizing that the differences between these methods are _microscopic_ compared to the difference between them and `c()` for growing vectors:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmarks <- bench::press(\n bench::mark(\n c = vector_c(n),\n prealloc_one_bracket = vector_prealloc_one_bracket(n),\n unalloc_one_bracket = vector_unalloc_one_bracket(n),\n unalloc_two_brackets = vector_unalloc_two_bracket(n),\n vapply = vapply_lambda(n),\n filter_gc = FALSE\n ),\n n = c(10, 100, 1000, 10000, 100000)\n)\n\nlibrary(ggplot2)\nggplot(benchmarks, aes(n, median, color = as.character(expression))) + \n geom_line() + \n theme_minimal() + \n labs(y = \"Median execution time (s)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nBut as far as execution speed goes, well, maybe growing objects in general isn't worthy of its own circle of hell anymore:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmarks[as.character(benchmarks$expression) != \"c\", ] |> \n ggplot(aes(n, median, color = as.character(expression))) + \n geom_line() + \n theme_minimal() + \n labs(y = \"Median execution time (s)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\nThough of course, `vapply()` and the pre-allocated methods still win out in terms of memory allocation:^[The pre-allocated line is hidden by the `vapply()` line; they're practically identical, and possibly also literally identical.]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmarks[as.character(benchmarks$expression) != \"c\", ] |> \n ggplot(aes(n, mem_alloc, color = as.character(expression))) + \n geom_line() + \n theme_minimal() + \n labs(y = \"Memory allocation (bytes)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nSo: pre-allocate your vectors when you're able. But maybe it's fine to grow an object every once in a while, as a treat. It probably won't get you sent to hell.\n\nI have no idea when things changed to make growing vectors via `[` so much more efficient now than in 2011 -- and please let me know in the comments/[Mastodon](https://fosstodon.org/@MikeMahoney218)/[BlueSky](https://bsky.app/profile/mikemahoney218.com) if you know any more details here. \n",
"markdown": "---\ntitle: \"Pre-allocating vectors is for nerds\"\ndescription: \"Or rather: growing objects is inefficient. But it's maybe not as big a deal as I'd believed.\"\nauthor:\n - name: Mike Mahoney\n url: {}\ndate: \"2023-08-29\"\ncategories: [R, Tutorials]\nimage: banner.jpg\nformat: \n html:\n toc: true\nengine: knitr\n---\n\n\nThe second circle of R hell, in [Patrick Burns' seminal book The R Inferno](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf), is titled \"Growing Objects\". This refers to a common antipattern for R users, usually among the first things taught when dealing with iteration: it is extremely inefficient to grow a vector using `c()`, like so:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector_c <- function(n) {\n out <- c()\n for (i in 1:n) {\n out <- c(out, i)\n }\n out\n}\n```\n:::\n\n\nInstead, Burns says, it is better to pre-allocate our vector `out`, and assign our function's output to a specific position in `out` using either `[` or `[[`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector_prealloc_one_bracket <- function(n) {\n out <- vector(\"numeric\", n)\n for (i in 1:n) {\n out[i] <- i\n }\n out\n}\n\nvector_prealloc_two_bracket <- function(n) {\n out <- vector(\"numeric\", n)\n for (i in 1:n) {\n out[[i]] <- i\n }\n out\n}\n```\n:::\n\n\nOf course, it would be better yet to avoid our loop entirely, and simply create our final object using the colon operator:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolon_operator <- function(n) {\n 1:n\n}\n```\n:::\n\n\nBut that's beside the point right now.\n\nThis advice was originally written in 2011, but is even more important today. In Burns' book, subsetting is roughly 7 times faster when `n` is 10,000; on my computer today, subsetting is roughly 200 times faster:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn <- 10000\nbench::mark(\n c = vector_c(n),\n one_bracket = vector_prealloc_one_bracket(n),\n two_brackets = vector_prealloc_two_bracket(n),\n colon = colon_operator(n),\n filter_gc = FALSE\n)[c(\"expression\", \"median\", \"itr/sec\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 4\n expression median `itr/sec` mem_alloc\n <bch:expr> <bch:tm> <dbl> <bch:byt>\n1 c 51ms 19.3 191.2MB\n2 one_bracket 277µs 3548. 99.1KB\n3 two_brackets 276µs 3538. 96.7KB\n4 colon 361ns 2124339. 0B\n```\n:::\n:::\n\n\nBut what if `n` is unknowable? Well, to quote Burns:\n\n> Often a reasonable upper bound on the size of the final object is known. If so,\nthen create the object with that size and then remove the extra values at the\nend. If the final size is a mystery, then you can still follow the same scheme,\nbut allow for periodic growth of the object.\n\nThis is still probably a decent approach: over-allocate and trim down, or allocate in chunks and only grow when those chunks are exhausted.\n\nOr... perhaps we might try growing a vector with `[` or `[[`, rather than with `c()`? To anyone raised on R traditions, this might seem like a code smell:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector_unalloc_one_bracket <- function(n) {\n out <- c()\n for (i in 1:n) {\n out[i] <- i\n }\n out\n}\n\nvector_unalloc_two_bracket <- function(n) {\n out <- c()\n for (i in 1:n) {\n out[[i]] <- i\n }\n unlist(out)\n}\n```\n:::\n\n\nBut if we test it out:^[I dropped `prealloc_two_brackets` from the benchmarks because it was performing ~the same as the one-bracket alternative.]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbench::mark(\n c = vector_c(n),\n prealloc_one_bracket = vector_prealloc_one_bracket(n),\n unalloc_one_bracket = vector_unalloc_one_bracket(n),\n unalloc_two_brackets = vector_unalloc_two_bracket(n),\n filter_gc = FALSE\n)[c(\"expression\", \"median\", \"itr/sec\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 4\n expression median `itr/sec` mem_alloc\n <bch:expr> <bch:tm> <dbl> <bch:byt>\n1 c 54.02ms 16.6 191.23MB\n2 prealloc_one_bracket 285.52µs 3428. 78.17KB\n3 unalloc_one_bracket 1.24ms 710. 871.73KB\n4 unalloc_two_brackets 2.76ms 337. 1.72MB\n```\n:::\n:::\n\n\nGrowing a vector via `[` is still notably slower than assigning values to a pre-allocated vector; it looks like it's roughly ~5 times slower. But that still means it's ~50 times faster than growing a vector via `c()`, and allocates ~200 times less memory to do so. Growing a vector via `[[` isn't quite as efficient -- taking roughly twice the time and memory as `[` here -- but still blows `c()` out of the water.\n\nThat's not too shabby, for a code smell. How does a method like `vapply()` compare?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvapply_lambda <- function(n) {\n vapply(1:n, \\(i) i, numeric(1))\n}\n\nbench::mark(\n c = vector_c(n),\n prealloc_one_bracket = vector_prealloc_one_bracket(n),\n unalloc_one_bracket = vector_unalloc_one_bracket(n),\n unalloc_two_brackets = vector_unalloc_two_bracket(n),\n vapply = vapply_lambda(n),\n filter_gc = FALSE\n)[c(\"expression\", \"median\", \"itr/sec\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 4\n expression median `itr/sec` mem_alloc\n <bch:expr> <bch:tm> <dbl> <bch:byt>\n1 c 50.87ms 19.5 191.2MB\n2 prealloc_one_bracket 279.79µs 3501. 78.2KB\n3 unalloc_one_bracket 1.18ms 649. 853KB\n4 unalloc_two_brackets 2.69ms 345. 1.7MB\n5 vapply 3.41ms 272. 78.2KB\n```\n:::\n:::\n\n\n`vapply()` uses as little memory as our pre-allocation approaches, but is slower than either of our un-allocated methods.^[Usual disclaimer that this is probably not a type of slowness that matters for your code, that you should look into moving computation to C++/Rust if you care about a few milliseconds execution time, and that the real benefits of *apply functions come from readability and their potential for parallelization, not speed.]\n\nIt's worth emphasizing that the differences between these methods are _microscopic_ compared to the difference between them and `c()` for growing vectors:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmarks <- bench::press(\n bench::mark(\n c = vector_c(n),\n prealloc_one_bracket = vector_prealloc_one_bracket(n),\n unalloc_one_bracket = vector_unalloc_one_bracket(n),\n unalloc_two_brackets = vector_unalloc_two_bracket(n),\n vapply = vapply_lambda(n),\n filter_gc = FALSE\n ),\n n = c(10, 100, 1000, 10000, 100000)\n)\n\nlibrary(ggplot2)\nggplot(benchmarks, aes(n, median, color = as.character(expression))) + \n geom_line() + \n theme_minimal() + \n labs(y = \"Median execution time (s)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nBut as far as execution speed goes, well, maybe growing objects in general isn't worthy of its own circle of hell anymore:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmarks[as.character(benchmarks$expression) != \"c\", ] |> \n ggplot(aes(n, median, color = as.character(expression))) + \n geom_line() + \n theme_minimal() + \n labs(y = \"Median execution time (s)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\nThough of course, `vapply()` and the pre-allocated methods still win out in terms of memory allocation:^[The pre-allocated line is hidden by the `vapply()` line; they're practically identical, and possibly also literally identical.]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmarks[as.character(benchmarks$expression) != \"c\", ] |> \n ggplot(aes(n, mem_alloc, color = as.character(expression))) + \n geom_line() + \n theme_minimal() + \n labs(y = \"Memory allocation (bytes)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nSo: pre-allocate your vectors when you're able. But maybe it's fine to grow an object every once in a while, as a treat. It probably won't get you sent to hell.\n\nI have no idea when things changed to make growing vectors via `[` so much more efficient now than in 2011 -- and please let me know in the comments/[Mastodon](https://fosstodon.org/@MikeMahoney218)/[BlueSky](https://bsky.app/profile/mikemahoney218.com) if you know any more details here. \n",
"supporting": [
"index_files"
],
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 39e2c02

Please sign in to comment.