data.table rewrite of logic functions, profileApply / mutate_profile #174

brownag · 2020-12-04T00:28:15Z

This is a sub-issue related to #157 -- which is too "big" to tackle in one bite IMO.

Significant benefits can come from data.table rewrite of logic functions and soon mutate_profile / profileApply in terms of memory usage and elapsed time for profile-level evaluations. For this sp4-based demo, the data.table rewrite is ~50x faster in fast mode (most commonly used output) and 10x faster in slow mode (identical output).

Here is demo showing a drop-in-replacement for checkHzDepthLogic.

library(data.table)
library(aqp, warn.conflicts = FALSE)
#> This is aqp 1.26

data(sp4)
depths(sp4) <- id ~ top + bottom

aqp_df_class(sp4) <- "data.table"
sp4 <- rebuildSPC(sp4)
#> using `hzID` as a unique horizon ID

New function

#' Fast checks of horizon depth logic in entire SoilProfileCollection
#' @param object A SoilProfileCollection
#' @param fast If details about specific test results are not needed, the operation can run approximately 5x faster. Default: TRUE.
#' @return A data.table containing profile IDs, validity boolean and test results if 
#' @export

checkAllLogic <- function(object, fast = TRUE) {
  h <- horizons(object)
  hzd <- horizonDepths(object)
  
  hby <- substitute(idname(object))
  top <- substitute(hzd[1])
  bottom <- substitute(hzd[2])
  
  if (!fast) {
    res <- h[, .(tests = list(tests = data.frame(t(hzDepthTests(eval(top), eval(bottom)))))), by = c(eval(hby))][ 
             , .(tests = tests, valid = all(!tests[[1]])), by = c(eval(hby))]
    res <- cbind(res, rbindlist(res$tests))
    res$tests <- NULL
    return(res)
  } else {
    res <- h[, all(!hzDepthTests(eval(top), eval(bottom))), by = c(eval(hby))]
    colnames(res) <- c(idname(object), "valid")
    return(res)
  }
}

"Fast" results

checkAllLogic(sp4)
#>                 id valid
#>  1:         colusa  TRUE
#>  2:          glenn  TRUE
#>  3:          kings  TRUE
#>  4:       mariposa  TRUE
#>  5:      mendocino  TRUE
#>  6:           napa  TRUE
#>  7:     san benito  TRUE
#>  8:         shasta  TRUE
#>  9: shasta-trinity  TRUE
#> 10:         tehama  TRUE

And the benchmarks

# raw data.table; just validity boolean
f1 <-  function() horizons(sp4)[, all(!hzDepthTests(top, bottom)), by = id]$V1 

# current aqp implementation
f2 <-  function() checkHzDepthLogic(sp4)$valid 

# fastish check of all logic; 
#   fast = TRUE: no individual test results;
#   fast = FALSE: same as checkHzDepthLogic output
f3 <-  function(fast = TRUE) checkAllLogic(sp4, fast)$valid 

bench::mark(x <- f1(),             # raw data.table
            x <- f2(),             # checkHzDepthLogic
            x <- f3(),             # checkAllLogic(fast=TRUE)
            x <- f3(fast = FALSE)) # checkAllLogic(fast=FALSE)

#> # A tibble: 4 x 6
#>   expression                 min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x <- f1()              843.5us  896.7us    1065.    48.62KB     13.3
#> 2 x <- f2()              46.82ms   50.1ms      20.0    3.23MB     79.8
#> 3 x <- f3()               1.22ms   1.31ms     741.    49.64KB     15.0
#> 4 x <- f3(fast = FALSE)   5.13ms   5.44ms     179.   377.66KB     16.3

The text was updated successfully, but these errors were encountered:

dylanbeaudette · 2020-12-04T02:50:51Z

Cool. Do you think that an on-the-fly conversion to data.table would be possible if starting from a data.frame? That is the approach that I've started with in the re-write of slice, seems to be minimal startup / tear-down costs.

brownag · 2020-12-04T04:15:19Z

There is not very much overhead with converting to data.table. It is more "costly" than data.frame, but less than tibble in my testing.

EDIT:

library(data.table)
library(tibble)
library(aqp, warn.conflicts = FALSE)
#> This is aqp 1.26

data(sp4)
depths(sp4) <- id ~ top + bottom
top <- sp4[1, ][[horizonDepths(sp4)[1]]]
bottom <- sp4[1, ][[horizonDepths(sp4)[2]]]

# data.frame -> x
microbenchmark::microbenchmark(data.frame(top, bottom), data.table(top, bottom), tibble(top, bottom))
#> Unit: microseconds
#>                     expr     min       lq      mean   median       uq       max   neval=100
#>  data.frame(top, bottom) 247.436 272.4865  346.8239 294.8015 327.1955  3598.218
#>  data.table(top, bottom) 195.566 227.3570  292.9348 242.6355 267.3560  1671.898
#>      tibble(top, bottom) 743.337 841.6940 1307.6296 911.5610 958.8015 38277.681

lsp4 <- as.list(data.frame(top, bottom))
# list -> x
microbenchmark::microbenchmark(as.data.frame(lsp4), as.data.table(lsp4), as_tibble(lsp4))
#> Unit: microseconds
#>                 expr     min       lq     mean   median       uq     max   neval=100
#>  as.data.frame(lsp4) 231.792 244.0465 257.8300 250.3375 260.6730 361.660  
#>  as.data.table(lsp4) 148.525 162.3775 178.7438 171.8700 179.1835 736.985  
#>      as_tibble(lsp4) 141.511 150.6820 162.7615 155.0220 160.4615 562.501

dylanbeaudette · 2020-12-07T19:04:03Z

What do you think about these three scenarios:

manual conversion to data.table required for the optimized code
optional argument to perform the on-the-fly conversion to / from data.table
silent conversion to / from data.table per some heuristic, user-defined option, or record count

I really like the idea of letting folks choose the tabular back-end, with the possibility to perform on-the-fly- conversion to / from when a significant performance gain is possible. In other words, I'm happy to use data.table to get something done, but not 100% ready to use it for all tasks.

pierreroudier · 2020-12-07T19:43:12Z

FWIW I agree with you @dylanbeaudette -- I think it's good to have advanced user deciding to use a performance-oriented "tabular" backend (either tibble or data.table or even something else like dbplyr in the future), but have the default behaviour using something less advanced users might be more familiar with, namely, ye old data.frame.

brownag · 2020-12-08T00:09:38Z

Given the way I have implemented data.table and tibble support, it is imminently possible to support some other things like dbplyr for "lazy" SoilProfileCollections. I have actually tried this -- several months ago -- would be worth revisiting given some other more recent developments in that arena.

I think we are all in agreement -- default for SPCs are data.frames -- and all you will ever see out of them are data.frames unless you manually change the data.frame class in the SPC and rebuild (as I do in the example above) or build with the desired object type. e.g. depths(some_tibble) <- id ~ top + bottom

On-the-fly conversion is really not an issue. I suppose if you want to manually trigger a different output, that could be done on a function specific basis. Personally, I prefer to get back in the class that all the slots of my SPC are. So, forgive me for not thinking this is a big deal. We are already doing mostly 1 and 3 for several things in aqp I have implemented. For instance, all calls to combine (formerly union) use data.table::rbindlist. Likewise, j-index subsetting with [ uses data.table. The user is none the wiser -- they always get results back in the class that they created their object with.

dylanbeaudette · 2020-12-08T06:08:45Z

Not a big deal, but trying to get a better idea of how you see these functions invoked and expectations. It seems like we are all on the same page. I've had similar conversations with myself about internal use of data.table in theslice re-write.

…als" as null locally #157 #174

* checkHzDepthLogic: use data.table instead of profileApply #157 #174 * data.table: R CMD check-safe .() -> list() and define "undefined globals" as null locally #157 #174 * #190; looks like this got uncommented when I rebased onto latest changes

brownag · 2021-01-22T20:41:51Z

I am going to close this issue since the new method I posted here was incorporated into checkHzDepthLogic.

My feeling on mutate_profile and profileApply are this:

mutate_profile major optimization coming soon (will post issue when closer to having properly generalized prototype)
profileApply will remain unchanged

brownag added a commit that referenced this issue Dec 4, 2020

Reprex for data.table checkHzDepthLogic #157 #174

7d80ff3

brownag added a commit that referenced this issue Jan 20, 2021

checkHzDepthLogic: use data.table instead of profileApply #157 #174

eebbbc6

brownag added a commit that referenced this issue Jan 20, 2021

data.table: R CMD check-safe .() -> list() and define "undefined glob…

2823053

…als" as null locally #157 #174

brownag added a commit that referenced this issue Jan 20, 2021

checkHzDepthLogic: use data.table instead of profileApply #157 #174

eb497f7

brownag added a commit that referenced this issue Jan 20, 2021

data.table: R CMD check-safe .() -> list() and define "undefined glob…

8f40d39

…als" as null locally #157 #174

brownag mentioned this issue Jan 20, 2021

Implement data.table checkHzDepthLogic RE #157, #174 #190

Merged

brownag closed this as completed Jan 22, 2021

brownag mentioned this issue Feb 12, 2021

horizon level reporting of depth logic errors in checkHzDepthLogic #197

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.table rewrite of logic functions, profileApply / mutate_profile #174

data.table rewrite of logic functions, profileApply / mutate_profile #174

brownag commented Dec 4, 2020

dylanbeaudette commented Dec 4, 2020

brownag commented Dec 4, 2020 •

edited

Loading

dylanbeaudette commented Dec 7, 2020

pierreroudier commented Dec 7, 2020

brownag commented Dec 8, 2020 •

edited

Loading

dylanbeaudette commented Dec 8, 2020

brownag commented Jan 22, 2021

data.table rewrite of logic functions, profileApply / mutate_profile #174

data.table rewrite of logic functions, profileApply / mutate_profile #174

Comments

brownag commented Dec 4, 2020

dylanbeaudette commented Dec 4, 2020

brownag commented Dec 4, 2020 • edited Loading

dylanbeaudette commented Dec 7, 2020

pierreroudier commented Dec 7, 2020

brownag commented Dec 8, 2020 • edited Loading

dylanbeaudette commented Dec 8, 2020

brownag commented Jan 22, 2021

brownag commented Dec 4, 2020 •

edited

Loading

brownag commented Dec 8, 2020 •

edited

Loading