Add post

mikemahoney218 · Oct 24, 2023 · 39e2c02 · 39e2c02
1 parent 483689a
commit 39e2c02
Show file tree

Hide file tree

Showing 5 changed files with 172 additions and 1 deletion.
diff --git a/_freeze/posts/2023-08-29-allocations/index/execute-results/html.json b/_freeze/posts/2023-08-29-allocations/index/execute-results/html.json
@@ -1,7 +1,7 @@
 {
   "hash": "3da9568db9d374209e8c71e3f521de3b",
   "result": {
-    "markdown": "---\ntitle: \"Pre-allocating vectors is for nerds\"\ndescription: \"Or rather: growing objects is inefficient. But it's maybe not as big a deal as I'd believed.\"\nauthor:\n  - name: Mike Mahoney\n    url: {}\ndate: \"2023-08-29\"\ncategories: [R, Tutorials]\nimage: banner.jpg\nformat: \n  html:\n    toc: true\nengine: knitr\n---\n\n\nThe second circle of R hell, in [Patrick Burns' seminal book The R Inferno](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf), is titled \"Growing Objects\". This refers to a common antipattern for R users, usually among the first things taught when dealing with iteration: it is extremely inefficient to grow a vector using `c()`, like so:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector_c <- function(n) {\n  out <- c()\n  for (i in 1:n) {\n    out <- c(out, i)\n  }\n  out\n}\n```\n:::\n\n\nInstead, Burns says, it is better to pre-allocate our vector `out`, and assign our function's output to a specific position in `out` using either `[` or `[[`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector_prealloc_one_bracket <- function(n) {\n  out <- vector(\"numeric\", n)\n  for (i in 1:n) {\n    out[i] <- i\n  }\n  out\n}\n\nvector_prealloc_two_bracket <- function(n) {\n  out <- vector(\"numeric\", n)\n  for (i in 1:n) {\n    out[[i]] <- i\n  }\n  out\n}\n```\n:::\n\n\nOf course, it would be better yet to avoid our loop entirely, and simply create our final object using the colon operator:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolon_operator <- function(n) {\n  1:n\n}\n```\n:::\n\n\nBut that's beside the point right now.\n\nThis advice was originally written in 2011, but is even more important today. In Burns' book, subsetting is roughly 7 times faster when `n` is 10,000; on my computer today, subsetting is roughly 200 times faster:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn <- 10000\nbench::mark(\n  c = vector_c(n),\n  one_bracket = vector_prealloc_one_bracket(n),\n  two_brackets = vector_prealloc_two_bracket(n),\n  colon = colon_operator(n),\n  filter_gc = FALSE\n)[c(\"expression\", \"median\", \"itr/sec\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 4\n  expression     median `itr/sec` mem_alloc\n  <bch:expr>   <bch:tm>     <dbl> <bch:byt>\n1 c              57.8ms      17.1   191.2MB\n2 one_bracket   270.9µs    3596.     99.1KB\n3 two_brackets  271.2µs    3570.     96.7KB\n4 colon         371.1ns 1841212.         0B\n```\n:::\n:::\n\n\nBut what if `n` is unknowable? Well, to quote Burns:\n\n> Often a reasonable upper bound on the size of the final object is known. If so,\nthen create the object with that size and then remove the extra values at the\nend. If the final size is a mystery, then you can still follow the same scheme,\nbut allow for periodic growth of the object.\n\nThis is still probably a decent approach: over-allocate and trim down, or allocate in chunks and only grow when those chunks are exhausted.\n\nOr... perhaps we might try growing a vector with `[` or `[[`, rather than with `c()`? To anyone raised on R traditions, this might seem like a code smell:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector_unalloc_one_bracket <- function(n) {\n  out <- c()\n  for (i in 1:n) {\n    out[i] <- i\n  }\n  out\n}\n\nvector_unalloc_two_bracket <- function(n) {\n  out <- c()\n  for (i in 1:n) {\n    out[[i]] <- i\n  }\n  unlist(out)\n}\n```\n:::\n\n\nBut if we test it out:^[I dropped `prealloc_two_brackets` from the benchmarks because it was performing ~the same as the one-bracket alternative.]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbench::mark(\n  c = vector_c(n),\n  prealloc_one_bracket = vector_prealloc_one_bracket(n),\n  unalloc_one_bracket = vector_unalloc_one_bracket(n),\n  unalloc_two_brackets = vector_unalloc_two_bracket(n),\n  filter_gc = FALSE\n)[c(\"expression\", \"median\", \"itr/sec\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 4\n  expression             median `itr/sec` mem_alloc\n  <bch:expr>           <bch:tm>     <dbl> <bch:byt>\n1 c                     61.23ms      14.8  191.23MB\n2 prealloc_one_bracket 269.86µs    3632.    78.17KB\n3 unalloc_one_bracket    1.27ms     679.   871.73KB\n4 unalloc_two_brackets   2.86ms     316.     1.72MB\n```\n:::\n:::\n\n\nGrowing a vector via `[` is still notably slower than assigning values to a pre-allocated vector; it looks like it's roughly ~5 times slower. But that still means it's ~50 times faster than growing a vector via `c()`, and allocates ~200 times less memory to do so. Growing a vector via `[[` isn't quite as efficient -- taking roughly twice the time and memory as `[` here -- but still blows `c()` out of the water.\n\nThat's not too shabby, for a code smell. How does a method like `vapply()` compare?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvapply_lambda <- function(n) {\n  vapply(1:n, \\(i) i, numeric(1))\n}\n\nbench::mark(\n  c = vector_c(n),\n  prealloc_one_bracket = vector_prealloc_one_bracket(n),\n  unalloc_one_bracket = vector_unalloc_one_bracket(n),\n  unalloc_two_brackets = vector_unalloc_two_bracket(n),\n  vapply = vapply_lambda(n),\n  filter_gc = FALSE\n)[c(\"expression\", \"median\", \"itr/sec\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 4\n  expression             median `itr/sec` mem_alloc\n  <bch:expr>           <bch:tm>     <dbl> <bch:byt>\n1 c                     51.14ms      19.2   191.2MB\n2 prealloc_one_bracket 268.72µs    3643.     78.2KB\n3 unalloc_one_bracket    1.33ms     567.      853KB\n4 unalloc_two_brackets   2.72ms     341.      1.7MB\n5 vapply                 3.44ms     270.     78.2KB\n```\n:::\n:::\n\n\n`vapply()` uses as little memory as our pre-allocation approaches, but is slower than either of our un-allocated methods.^[Usual disclaimer that this is probably not a type of slowness that matters for your code, that you should look into moving computation to C++/Rust if you care about a few milliseconds execution time, and that the real benefits of *apply functions come from readability and their potential for parallelization, not speed.]\n\nIt's worth emphasizing that the differences between these methods are _microscopic_ compared to the difference between them and `c()` for growing vectors:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmarks <- bench::press(\n  bench::mark(\n    c = vector_c(n),\n    prealloc_one_bracket = vector_prealloc_one_bracket(n),\n    unalloc_one_bracket = vector_unalloc_one_bracket(n),\n    unalloc_two_brackets = vector_unalloc_two_bracket(n),\n    vapply = vapply_lambda(n),\n    filter_gc = FALSE\n  ),\n  n = c(10, 100, 1000, 10000, 100000)\n)\n\nlibrary(ggplot2)\nggplot(benchmarks, aes(n, median, color = as.character(expression))) + \n  geom_line() + \n  theme_minimal() + \n  labs(y = \"Median execution time (s)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nBut as far as execution speed goes, well, maybe growing objects in general isn't worthy of its own circle of hell anymore:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmarks[as.character(benchmarks$expression) != \"c\", ] |> \n  ggplot(aes(n, median, color = as.character(expression))) + \n  geom_line() + \n  theme_minimal() + \n  labs(y = \"Median execution time (s)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\nThough of course, `vapply()` and the pre-allocated methods still win out in terms of memory allocation:^[The pre-allocated line is hidden by the `vapply()` line; they're practically identical, and possibly also literally identical.]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmarks[as.character(benchmarks$expression) != \"c\", ] |> \n  ggplot(aes(n, mem_alloc, color = as.character(expression))) + \n  geom_line() + \n  theme_minimal() + \n  labs(y = \"Memory allocation (bytes)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nSo: pre-allocate your vectors when you're able. But maybe it's fine to grow an object every once in a while, as a treat. It probably won't get you sent to hell.\n\nI have no idea when things changed to make growing vectors via `[` so much more efficient now than in 2011 -- and please let me know in the comments/[Mastodon](https://fosstodon.org/@MikeMahoney218)/[BlueSky](https://bsky.app/profile/mikemahoney218.com) if you know any more details here. \n",
+    "markdown": "---\ntitle: \"Pre-allocating vectors is for nerds\"\ndescription: \"Or rather: growing objects is inefficient. But it's maybe not as big a deal as I'd believed.\"\nauthor:\n  - name: Mike Mahoney\n    url: {}\ndate: \"2023-08-29\"\ncategories: [R, Tutorials]\nimage: banner.jpg\nformat: \n  html:\n    toc: true\nengine: knitr\n---\n\n\nThe second circle of R hell, in [Patrick Burns' seminal book The R Inferno](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf), is titled \"Growing Objects\". This refers to a common antipattern for R users, usually among the first things taught when dealing with iteration: it is extremely inefficient to grow a vector using `c()`, like so:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector_c <- function(n) {\n  out <- c()\n  for (i in 1:n) {\n    out <- c(out, i)\n  }\n  out\n}\n```\n:::\n\n\nInstead, Burns says, it is better to pre-allocate our vector `out`, and assign our function's output to a specific position in `out` using either `[` or `[[`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector_prealloc_one_bracket <- function(n) {\n  out <- vector(\"numeric\", n)\n  for (i in 1:n) {\n    out[i] <- i\n  }\n  out\n}\n\nvector_prealloc_two_bracket <- function(n) {\n  out <- vector(\"numeric\", n)\n  for (i in 1:n) {\n    out[[i]] <- i\n  }\n  out\n}\n```\n:::\n\n\nOf course, it would be better yet to avoid our loop entirely, and simply create our final object using the colon operator:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolon_operator <- function(n) {\n  1:n\n}\n```\n:::\n\n\nBut that's beside the point right now.\n\nThis advice was originally written in 2011, but is even more important today. In Burns' book, subsetting is roughly 7 times faster when `n` is 10,000; on my computer today, subsetting is roughly 200 times faster:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nn <- 10000\nbench::mark(\n  c = vector_c(n),\n  one_bracket = vector_prealloc_one_bracket(n),\n  two_brackets = vector_prealloc_two_bracket(n),\n  colon = colon_operator(n),\n  filter_gc = FALSE\n)[c(\"expression\", \"median\", \"itr/sec\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 4\n  expression     median `itr/sec` mem_alloc\n  <bch:expr>   <bch:tm>     <dbl> <bch:byt>\n1 c                51ms      19.3   191.2MB\n2 one_bracket     277µs    3548.     99.1KB\n3 two_brackets    276µs    3538.     96.7KB\n4 colon           361ns 2124339.         0B\n```\n:::\n:::\n\n\nBut what if `n` is unknowable? Well, to quote Burns:\n\n> Often a reasonable upper bound on the size of the final object is known. If so,\nthen create the object with that size and then remove the extra values at the\nend. If the final size is a mystery, then you can still follow the same scheme,\nbut allow for periodic growth of the object.\n\nThis is still probably a decent approach: over-allocate and trim down, or allocate in chunks and only grow when those chunks are exhausted.\n\nOr... perhaps we might try growing a vector with `[` or `[[`, rather than with `c()`? To anyone raised on R traditions, this might seem like a code smell:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvector_unalloc_one_bracket <- function(n) {\n  out <- c()\n  for (i in 1:n) {\n    out[i] <- i\n  }\n  out\n}\n\nvector_unalloc_two_bracket <- function(n) {\n  out <- c()\n  for (i in 1:n) {\n    out[[i]] <- i\n  }\n  unlist(out)\n}\n```\n:::\n\n\nBut if we test it out:^[I dropped `prealloc_two_brackets` from the benchmarks because it was performing ~the same as the one-bracket alternative.]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbench::mark(\n  c = vector_c(n),\n  prealloc_one_bracket = vector_prealloc_one_bracket(n),\n  unalloc_one_bracket = vector_unalloc_one_bracket(n),\n  unalloc_two_brackets = vector_unalloc_two_bracket(n),\n  filter_gc = FALSE\n)[c(\"expression\", \"median\", \"itr/sec\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 4\n  expression             median `itr/sec` mem_alloc\n  <bch:expr>           <bch:tm>     <dbl> <bch:byt>\n1 c                     54.02ms      16.6  191.23MB\n2 prealloc_one_bracket 285.52µs    3428.    78.17KB\n3 unalloc_one_bracket    1.24ms     710.   871.73KB\n4 unalloc_two_brackets   2.76ms     337.     1.72MB\n```\n:::\n:::\n\n\nGrowing a vector via `[` is still notably slower than assigning values to a pre-allocated vector; it looks like it's roughly ~5 times slower. But that still means it's ~50 times faster than growing a vector via `c()`, and allocates ~200 times less memory to do so. Growing a vector via `[[` isn't quite as efficient -- taking roughly twice the time and memory as `[` here -- but still blows `c()` out of the water.\n\nThat's not too shabby, for a code smell. How does a method like `vapply()` compare?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvapply_lambda <- function(n) {\n  vapply(1:n, \\(i) i, numeric(1))\n}\n\nbench::mark(\n  c = vector_c(n),\n  prealloc_one_bracket = vector_prealloc_one_bracket(n),\n  unalloc_one_bracket = vector_unalloc_one_bracket(n),\n  unalloc_two_brackets = vector_unalloc_two_bracket(n),\n  vapply = vapply_lambda(n),\n  filter_gc = FALSE\n)[c(\"expression\", \"median\", \"itr/sec\", \"mem_alloc\")]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 4\n  expression             median `itr/sec` mem_alloc\n  <bch:expr>           <bch:tm>     <dbl> <bch:byt>\n1 c                     50.87ms      19.5   191.2MB\n2 prealloc_one_bracket 279.79µs    3501.     78.2KB\n3 unalloc_one_bracket    1.18ms     649.      853KB\n4 unalloc_two_brackets   2.69ms     345.      1.7MB\n5 vapply                 3.41ms     272.     78.2KB\n```\n:::\n:::\n\n\n`vapply()` uses as little memory as our pre-allocation approaches, but is slower than either of our un-allocated methods.^[Usual disclaimer that this is probably not a type of slowness that matters for your code, that you should look into moving computation to C++/Rust if you care about a few milliseconds execution time, and that the real benefits of *apply functions come from readability and their potential for parallelization, not speed.]\n\nIt's worth emphasizing that the differences between these methods are _microscopic_ compared to the difference between them and `c()` for growing vectors:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmarks <- bench::press(\n  bench::mark(\n    c = vector_c(n),\n    prealloc_one_bracket = vector_prealloc_one_bracket(n),\n    unalloc_one_bracket = vector_unalloc_one_bracket(n),\n    unalloc_two_brackets = vector_unalloc_two_bracket(n),\n    vapply = vapply_lambda(n),\n    filter_gc = FALSE\n  ),\n  n = c(10, 100, 1000, 10000, 100000)\n)\n\nlibrary(ggplot2)\nggplot(benchmarks, aes(n, median, color = as.character(expression))) + \n  geom_line() + \n  theme_minimal() + \n  labs(y = \"Median execution time (s)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-8-1.png){width=672}\n:::\n:::\n\n\nBut as far as execution speed goes, well, maybe growing objects in general isn't worthy of its own circle of hell anymore:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmarks[as.character(benchmarks$expression) != \"c\", ] |> \n  ggplot(aes(n, median, color = as.character(expression))) + \n  geom_line() + \n  theme_minimal() + \n  labs(y = \"Median execution time (s)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n\nThough of course, `vapply()` and the pre-allocated methods still win out in terms of memory allocation:^[The pre-allocated line is hidden by the `vapply()` line; they're practically identical, and possibly also literally identical.]\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbenchmarks[as.character(benchmarks$expression) != \"c\", ] |> \n  ggplot(aes(n, mem_alloc, color = as.character(expression))) + \n  geom_line() + \n  theme_minimal() + \n  labs(y = \"Memory allocation (bytes)\")\n```\n\n::: {.cell-output-display}\n![](index_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\nSo: pre-allocate your vectors when you're able. But maybe it's fine to grow an object every once in a while, as a treat. It probably won't get you sent to hell.\n\nI have no idea when things changed to make growing vectors via `[` so much more efficient now than in 2011 -- and please let me know in the comments/[Mastodon](https://fosstodon.org/@MikeMahoney218)/[BlueSky](https://bsky.app/profile/mikemahoney218.com) if you know any more details here. \n",
     "supporting": [
       "index_files"
     ],

diff --git a/_freeze/posts/2023-08-29-allocations/index/figure-html/unnamed-chunk-8-1.png b/_freeze/posts/2023-08-29-allocations/index/figure-html/unnamed-chunk-8-1.png
diff --git a/_freeze/posts/2023-08-29-allocations/index/figure-html/unnamed-chunk-9-1.png b/_freeze/posts/2023-08-29-allocations/index/figure-html/unnamed-chunk-9-1.png