Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table rewrite of logic functions, profileApply / mutate_profile #174

Closed
brownag opened this issue Dec 4, 2020 · 7 comments
Closed

Comments

@brownag
Copy link
Member

brownag commented Dec 4, 2020

This is a sub-issue related to #157 -- which is too "big" to tackle in one bite IMO.

Significant benefits can come from data.table rewrite of logic functions and soon mutate_profile / profileApply in terms of memory usage and elapsed time for profile-level evaluations. For this sp4-based demo, the data.table rewrite is ~50x faster in fast mode (most commonly used output) and 10x faster in slow mode (identical output).

Here is demo showing a drop-in-replacement for checkHzDepthLogic.

library(data.table)
library(aqp, warn.conflicts = FALSE)
#> This is aqp 1.26

data(sp4)
depths(sp4) <- id ~ top + bottom

aqp_df_class(sp4) <- "data.table"
sp4 <- rebuildSPC(sp4)
#> using `hzID` as a unique horizon ID

New function

#' Fast checks of horizon depth logic in entire SoilProfileCollection
#' @param object A SoilProfileCollection
#' @param fast If details about specific test results are not needed, the operation can run approximately 5x faster. Default: TRUE.
#' @return A data.table containing profile IDs, validity boolean and test results if 
#' @export

checkAllLogic <- function(object, fast = TRUE) {
  h <- horizons(object)
  hzd <- horizonDepths(object)
  
  hby <- substitute(idname(object))
  top <- substitute(hzd[1])
  bottom <- substitute(hzd[2])
  
  if (!fast) {
    res <- h[, .(tests = list(tests = data.frame(t(hzDepthTests(eval(top), eval(bottom)))))), by = c(eval(hby))][ 
             , .(tests = tests, valid = all(!tests[[1]])), by = c(eval(hby))]
    res <- cbind(res, rbindlist(res$tests))
    res$tests <- NULL
    return(res)
  } else {
    res <- h[, all(!hzDepthTests(eval(top), eval(bottom))), by = c(eval(hby))]
    colnames(res) <- c(idname(object), "valid")
    return(res)
  }
}

"Fast" results

checkAllLogic(sp4)
#>                 id valid
#>  1:         colusa  TRUE
#>  2:          glenn  TRUE
#>  3:          kings  TRUE
#>  4:       mariposa  TRUE
#>  5:      mendocino  TRUE
#>  6:           napa  TRUE
#>  7:     san benito  TRUE
#>  8:         shasta  TRUE
#>  9: shasta-trinity  TRUE
#> 10:         tehama  TRUE

And the benchmarks

# raw data.table; just validity boolean
f1 <-  function() horizons(sp4)[, all(!hzDepthTests(top, bottom)), by = id]$V1 

# current aqp implementation
f2 <-  function() checkHzDepthLogic(sp4)$valid 

# fastish check of all logic; 
#   fast = TRUE: no individual test results;
#   fast = FALSE: same as checkHzDepthLogic output
f3 <-  function(fast = TRUE) checkAllLogic(sp4, fast)$valid 

bench::mark(x <- f1(),             # raw data.table
            x <- f2(),             # checkHzDepthLogic
            x <- f3(),             # checkAllLogic(fast=TRUE)
            x <- f3(fast = FALSE)) # checkAllLogic(fast=FALSE)

#> # A tibble: 4 x 6
#>   expression                 min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x <- f1()              843.5us  896.7us    1065.    48.62KB     13.3
#> 2 x <- f2()              46.82ms   50.1ms      20.0    3.23MB     79.8
#> 3 x <- f3()               1.22ms   1.31ms     741.    49.64KB     15.0
#> 4 x <- f3(fast = FALSE)   5.13ms   5.44ms     179.   377.66KB     16.3
@dylanbeaudette
Copy link
Member

Cool. Do you think that an on-the-fly conversion to data.table would be possible if starting from a data.frame? That is the approach that I've started with in the re-write of slice, seems to be minimal startup / tear-down costs.

@brownag
Copy link
Member Author

brownag commented Dec 4, 2020

There is not very much overhead with converting to data.table. It is more "costly" than data.frame, but less than tibble in my testing.

EDIT:

library(data.table)
library(tibble)
library(aqp, warn.conflicts = FALSE)
#> This is aqp 1.26

data(sp4)
depths(sp4) <- id ~ top + bottom
top <- sp4[1, ][[horizonDepths(sp4)[1]]]
bottom <- sp4[1, ][[horizonDepths(sp4)[2]]]

# data.frame -> x
microbenchmark::microbenchmark(data.frame(top, bottom), data.table(top, bottom), tibble(top, bottom))
#> Unit: microseconds
#>                     expr     min       lq      mean   median       uq       max   neval=100
#>  data.frame(top, bottom) 247.436 272.4865  346.8239 294.8015 327.1955  3598.218
#>  data.table(top, bottom) 195.566 227.3570  292.9348 242.6355 267.3560  1671.898
#>      tibble(top, bottom) 743.337 841.6940 1307.6296 911.5610 958.8015 38277.681

lsp4 <- as.list(data.frame(top, bottom))
# list -> x
microbenchmark::microbenchmark(as.data.frame(lsp4), as.data.table(lsp4), as_tibble(lsp4))
#> Unit: microseconds
#>                 expr     min       lq     mean   median       uq     max   neval=100
#>  as.data.frame(lsp4) 231.792 244.0465 257.8300 250.3375 260.6730 361.660  
#>  as.data.table(lsp4) 148.525 162.3775 178.7438 171.8700 179.1835 736.985  
#>      as_tibble(lsp4) 141.511 150.6820 162.7615 155.0220 160.4615 562.501  

@dylanbeaudette
Copy link
Member

What do you think about these three scenarios:

  • manual conversion to data.table required for the optimized code
  • optional argument to perform the on-the-fly conversion to / from data.table
  • silent conversion to / from data.table per some heuristic, user-defined option, or record count

I really like the idea of letting folks choose the tabular back-end, with the possibility to perform on-the-fly- conversion to / from when a significant performance gain is possible. In other words, I'm happy to use data.table to get something done, but not 100% ready to use it for all tasks.

@pierreroudier
Copy link
Collaborator

FWIW I agree with you @dylanbeaudette -- I think it's good to have advanced user deciding to use a performance-oriented "tabular" backend (either tibble or data.table or even something else like dbplyr in the future), but have the default behaviour using something less advanced users might be more familiar with, namely, ye old data.frame.

@brownag
Copy link
Member Author

brownag commented Dec 8, 2020

Given the way I have implemented data.table and tibble support, it is imminently possible to support some other things like dbplyr for "lazy" SoilProfileCollections. I have actually tried this -- several months ago -- would be worth revisiting given some other more recent developments in that arena.

I think we are all in agreement -- default for SPCs are data.frames -- and all you will ever see out of them are data.frames unless you manually change the data.frame class in the SPC and rebuild (as I do in the example above) or build with the desired object type. e.g. depths(some_tibble) <- id ~ top + bottom

On-the-fly conversion is really not an issue. I suppose if you want to manually trigger a different output, that could be done on a function specific basis. Personally, I prefer to get back in the class that all the slots of my SPC are. So, forgive me for not thinking this is a big deal. We are already doing mostly 1 and 3 for several things in aqp I have implemented. For instance, all calls to combine (formerly union) use data.table::rbindlist. Likewise, j-index subsetting with [ uses data.table. The user is none the wiser -- they always get results back in the class that they created their object with.

@dylanbeaudette
Copy link
Member

Not a big deal, but trying to get a better idea of how you see these functions invoked and expectations. It seems like we are all on the same page. I've had similar conversations with myself about internal use of data.table in theslice re-write.

brownag added a commit that referenced this issue Jan 20, 2021
brownag added a commit that referenced this issue Jan 20, 2021
brownag added a commit that referenced this issue Jan 21, 2021
* checkHzDepthLogic: use data.table instead of profileApply #157 #174

* data.table: R CMD check-safe .() -> list() and define "undefined globals" as null locally #157 #174

* #190; looks like this got uncommented when I rebased onto latest changes
@brownag
Copy link
Member Author

brownag commented Jan 22, 2021

I am going to close this issue since the new method I posted here was incorporated into checkHzDepthLogic.

My feeling on mutate_profile and profileApply are this:

  • mutate_profile major optimization coming soon (will post issue when closer to having properly generalized prototype)
  • profileApply will remain unchanged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants