Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table optimization of SPC j-index and adding k-keywords #201

Merged
merged 10 commits into from
Feb 24, 2021
Merged

Conversation

brownag
Copy link
Member

@brownag brownag commented Feb 15, 2021

This was inspired by discussion on #199, a continuation of the efforts in #157 and of prior optimizations of SPC operators, and customizations for data.table SPCs in #155. This PR is extending the benefits for all SPC data.frame subclasses.

EDIT:

  • we need to strongly consider tradeoffs of set-up/teardown of data.table approaches, considering subset operations are disproportionately applied to smaller SPCs, optimizing specifically for large SPCs may be the wrong target. Triggering more performant code at a length(SPC) threshold e.g. 10000 may be an idea to consider.

EDIT2:
Agreeing with Dylan below. I reverted this to draft for now.

Some more explanation. Before this PR, data.table SPC j-index should only be invoked if the aqp_df_class of the object is "data.table". This PR removes one of the few code "forks" between data.table and base in favor of a data.table.

We could more formally "archive" base R equivalent code using in tests where things should be expected to be the same. For now, I left mine commented, and several new realizations of the base code in this PR worth trying out.

In some way I see git/PR/issue history as documentation of some decisions and options, but that isn't ideal. The current state of things in master reflects my best judgment [at the time] about when to invoke data.table -- usually based on some fairly thoughtful benchmarks, but imperfect and a product of their time to be sure. As this PR reveals.

I think I would be willing to accept a small performance hit for small SPCs in exchange for better scaling. Possibly being less if we adopted data.table as a paradigm, unclear to me at this point where/when exactly the performance hit happens.

There may be plenty of cases where using data.table rather than a base approach is not necessary (even if the object is a data.table). Please check out the jindex-benchmarks.R file on this dtspcj branch if interested, but the small-SPC benchmark "concerning" me is this one

> microbenchmark::microbenchmark(spc_j_base(sp4, 1:10, 2:3),
+                                spc_j_base_2(sp4, 1:10, 2:3),
+                                spc_j_base_3(sp4, 1:10, 2:3),
+                                spc_j_DT(sp4, 1:10, 2:3),
+                                spc_j_DT_2(sp4, 1:10, 2:3))
Unit: microseconds
                         expr      min        lq      mean    median       uq      max neval
   spc_j_base(sp4, 1:10, 2:3)  990.893 1049.6240 1344.2467 1095.1280 1146.255 8960.763   100
 spc_j_base_2(sp4, 1:10, 2:3)  885.724  931.8925  973.7731  955.7965 1003.316 1247.136   100
 spc_j_base_3(sp4, 1:10, 2:3)  881.381  920.3695  989.1439  956.7870 1011.452 1380.955   100
     spc_j_DT(sp4, 1:10, 2:3) 1573.848 1683.0550 1761.5464 1722.8060 1819.496 2224.064   100
   spc_j_DT_2(sp4, 1:10, 2:3)  938.231  987.7575 1039.5228 1024.0705 1062.644 1584.036   100

The differences between base 2, base 3 and data.table 2 are pretty marginal at this scale. Overall data.table appears about 5-10% slower by all stats -- for order of magnitude of 10^0 / 10^1 profile SPCs. By 100 profiles data.table is ahead.

Another thing to consider is that data.table can probably (?) make better use of e.g. multiple cores. But is that "overkill" for tiny operations invoked behind the scenes that can't be (or aren't) effectively parallelized? I need to play around more with this.

Finally, are Windows machines able to realize full benefits of data.table? How about ones with crappy real-time scanning? Can we set up a matrix comparison of architectures, number of cores, etc?


Original post
I realized the j-index extraction could be made more efficient than base, thus allowing me to use the data.table code even when aqp_df_class(object) != "data.table". There was an extra costly operation in there that was not needed that was the difference between outperforming base or not on smaller collections.

I added a demo in the /misc/aqp2 folder where many of my "big SPC" optimizations were being tested. This shows a significant move towards more performant SPC operations that comes from a better understanding of memory management and use of data.table that I have been working on over past several months.

The following benchmark compares the base (aggregate-based) code that I optimized several months ago, the "optional" less efficient DT code, and the new better DT code on a random 100k profile collection (see demo .R file)

NOTE: these benchmarks do not include any i index subsetting, despite 1:10 as i argument. i-index subsetting is orders of magnitude faster and can be readily accomplished with base.

Unit: milliseconds
                                   expr       min        lq      mean    median        uq       max neval
 spc_j_base(hundredthousand, 1:10, 2:3) 1188.0499 1229.9885 1262.2579 1254.0748 1280.6622 1358.5141     5
   spc_j_DT(hundredthousand, 1:10, 2:3)  786.5234  795.5842  810.3728  797.6119  814.2999  857.8445     5
 spc_j_DT_2(hundredthousand, 1:10, 2:3)  389.5375  429.3983  525.6855  432.5545  436.3224  940.6149     5

@brownag
Copy link
Member Author

brownag commented Feb 15, 2021

For what it is worth, I tried several more optimizations on the spc_j_base function.

The aggregate FUN can return a logical vector rather than a list for a minor benefit. do.call('c', ...) may be sometimes faster or safer than unlist --- but the latter was benchmarking faster I think -- not sure about that. I don't think vapply(... FUN.VALUE=logical(1)) helps much.

EDIT: show benchmark with minimum iterations = 10

# A tibble: 4 x 13
  expression                                    min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 spc_j_base(hundredthousand, 1:10, 2:3)      1.25s    1.33s     0.747        NA     1.87    10    25     13.38s
2 spc_j_base_2(hundredthousand, 1:10, 2:3)    1.16s     1.2s     0.821        NA     1.89    10    23     12.18s
3 spc_j_DT(hundredthousand, 1:10, 2:3)     797.17ms 828.09ms     1.18         NA     2.70    10    23      8.51s
4 spc_j_DT_2(hundredthousand, 1:10, 2:3)   416.67ms 437.63ms     2.15         NA     2.15    10    10      4.64s

@brownag
Copy link
Member Author

brownag commented Feb 15, 2021

I really didn't intend to work on this over the long weekend, but got excited when I saw these opportunities.

We need to strongly consider tradeoffs of set-up/teardown of data.table approaches, considering subset operations are disproportionately applied to smaller SPCs, optimizing specifically for large SPCs may be the wrong target. Triggering more performant code at a length(SPC) threshold e.g. 10000 may be an idea to consider.

@dylanbeaudette
Copy link
Member

We need to strongly consider tradeoffs of set-up/teardown of data.table approaches, considering subset operations are disproportionately applied to smaller SPCs, optimizing specifically for large SPCs may be the wrong target. Triggering more performant code at a length(SPC) threshold e.g. 10000 may be an idea to consider.

Thanks for putting this together.

This is an important consideration, thanks for suggesting it. I was starting to wonder about the intersection of relative performance "curves" for base vs. DT. We should outline a strategy (affected functions, threshold for switching implementation) before bolting on parallel implementation. Also, it would be ideal if switching from base to DT be constrained to the lowest level of abstraction as possible--I don't look forward to maintaining / debugging forking paths in every function. I'm going to make a similar suggestion in #199.

@brownag
Copy link
Member Author

brownag commented Feb 16, 2021

It would be cool if the SPC j-index (or a new k-index?) supported .N or a similar type of keyword/argument/NSE-- which could make the getLastHorizonID function discussed in #199 essentially just a wrapper around [,SoilProfileCollection-method that has special syntax for the "last horizon" case.

@dylanbeaudette
Copy link
Member

It would be cool if the SPC j-index (or a new k-index?) supported .N or a similar type of keyword/argument/NSE-- which could make the getLastHorizonID function discussed in #199 essentially just a wrapper around [,SoilProfileCollection-method that has special syntax for the "last horizon" case.

This is a neat idea. I've had thoughts about making something like this, but in the form of a "slice" index spc[, , 45] would be the same as slice(spc, 45 ~ .). You can use the j index with sliced data, but that is an extra step. Anyways, your idea makes more sense in that there is frequently a need to get the first and last horizons--great justification for a special addition to the [ method. I have no current plans to implement the "slice index", focusing on the slice, slab, and profile_compare re-writes.

@brownag
Copy link
Member Author

brownag commented Feb 17, 2021

So, I got SPC[, , .LAST] working... and committed it to this draft PR.

# DEMO: base non-standard eval of keyword in ... "k-index"

library(aqp, warn = FALSE)
#> This is aqp 1.27

data(sp4)
depths(sp4) <- id ~ top + bottom

# .LAST: last horizon in profiles 5 thru 10
sp4[5:10, , .LAST]
#> SoilProfileCollection with 6 profiles and 6 horizons
#> profile ID: id  |  horizon ID: hzID 
#> Depth range: 16 - 40 cm
#> 
#> ----- Horizons (6 / 6 rows  |  10 / 14 columns) -----
#>              id hzID top bottom name   K   Mg   Ca CEC_7 ex_Ca_to_Mg
#>       mendocino   16   8     30  Bt2 0.2 30.5  3.7  22.9        0.12
#>            napa   18   6     20   Bt 0.1 16.2 21.5  27.9        1.32
#>      san benito   20   8     20   Bt 0.0  0.1  5.6   5.6        0.11
#>          shasta   22   3     40   Bt 0.2 10.1  2.0  12.2        0.20
#>  shasta-trinity   27  23     40  Bt2 0.1 64.9  0.8  48.7        0.01
#>          tehama   30   7     16  Bt2 0.2 27.7 13.7  30.0        0.50
#> 
#> ----- Sites (6 / 6 rows  |  1 / 1 columns) -----
#>              id
#>       mendocino
#>            napa
#>      san benito
#>          shasta
#>  shasta-trinity
#>          tehama
#> 
#> Spatial Data: [EMPTY]

# use it to implement #199
getLastHorizonID <- function(x) {
  res <-  hzID(x[, , .LAST])
  names(res) <- profile_id(x)
  return(res)
}

getLastHorizonID(sp4)
#>         colusa          glenn          kings       mariposa      mendocino 
#>            "4"            "6"            "9"           "13"           "16" 
#>           napa     san benito         shasta shasta-trinity         tehama 
#>           "18"           "20"           "22"           "27"           "30"

I think f109923 lays down a pretty extensible framework for other types of operations in this ... argument I added, but I am not in a huge rush to open up the floodgates. I need to tend to a few other things myself.

But for instance, I could implement SPC[, , {expr}, {expr}] where if {expr} is not some special "trigger" expression like .LAST, it is passed to a data.table call to do some site/horizon data crunching (on the subset data grouped by profile/hz) to add new column(s) based on some calculation. These results would be passed through to the new SPC that results. I am not sure whether this could improve on some profileApply based workflows... but it is a thought especially for cases that primarily rely on horizon data.

I think it makes sense to do this in [,SoilProfileCollection-method as it is one of the few methods to use the SoilProfileCollection() constructor, and thus is building an object anew. It has all of the pieces extracted and subsetted, so it is a good place to even take into account other slots -- like spatial, or diagnostics etc. and somehow making them "available" for these expressions to build a new object.

All that said, I am not super attached to the idea, just showing how it could be implemented.

@dylanbeaudette
Copy link
Member

I think it makes sense to do this in [,SoilProfileCollection-method as it is one of the few methods to use the SoilProfileCollection() constructor, and thus is building an object anew. It has all of the pieces extracted and subsetted, so it is a good place to even take into account other slots -- like spatial, or diagnostics etc. and somehow making them "available" for these expressions to build a new object.

I do like the idea of extended [ syntax to avoid a large number of helper functions. I can't offer much else right now other than I think that this is a really nice idea and will make for very compact syntax in the future. I'm not convinced that we need to go quite as compact as data.table but we can definitely build on what they have done.

Nice work.

@brownag
Copy link
Member Author

brownag commented Feb 21, 2021

More tinkering with some new k keywords following and cleaning up ideas from implementing .LAST above.

  • .FIRST (first horizon, identical to j = 1)

  • .HZID triggers more efficient short circuit for horizon index position (in source SoilProfileCollection).

EDIT: fix for i+j+k indexing combined

# DEMO: base non-standard eval of keyword in ... "k-index": SPC[i, j, ...]
# version 1: add support for .LAST
# version 2: add support for .FIRST, .HZID and multiple special keywords

library(aqp, warn = FALSE)
#> This is aqp 1.27

data(sp4)
depths(sp4) <- id ~ top + bottom

# .LAST
sp4[, , .LAST]
#> SoilProfileCollection with 10 profiles and 10 horizons
#> profile ID: id  |  horizon ID: hzID 
#> Depth range: 16 - 49 cm
#> 
#> ----- Horizons (6 / 10 rows  |  10 / 14 columns) -----
#>         id hzID top bottom name   K   Mg   Ca CEC_7 ex_Ca_to_Mg
#>     colusa    4  30     42  Bt2 0.1 44.3  0.3  43.0        0.01
#>      glenn    6   9     34   Bt 0.3 18.9  4.5  27.5        0.20
#>      kings    9  13     40  Bt2 0.8 17.7  4.4  20.0        0.25
#>   mariposa   13  34     49  Bt3 0.1 78.2  4.4  43.6        0.06
#>  mendocino   16   8     30  Bt2 0.2 30.5  3.7  22.9        0.12
#>       napa   18   6     20   Bt 0.1 16.2 21.5  27.9        1.32
#> [... more horizons ...]
#> 
#> ----- Sites (6 / 10 rows  |  1 / 1 columns) -----
#>         id
#>     colusa
#>      glenn
#>      kings
#>   mariposa
#>  mendocino
#>       napa
#> [... more sites ...]
#> 
#> Spatial Data: [EMPTY]
sp4[, , .HZID]
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30
sp4[, , .LAST, .HZID]
#>  [1]  4  6  9 13 16 18 20 22 27 30
sp4[, , .FIRST, .HZID] # this just sets j <- 1
#>  [1]  1  5  7 10 14 17 19 21 23 28
sp4[, 1, , .HZID]
#>  [1]  1  5  7 10 14 17 19 21 23 28
sp4[, 1000, .FIRST, .HZID] # this sets j <- 1; ignores j input if given
#>  [1]  1  5  7 10 14 17 19 21 23 28
sp4[, 1000, .LAST, .HZID] # this ignores j input if given
#>  [1]  4  6  9 13 16 18 20 22 27 30

# horizon index of 2nd horizon in each profile
sp4[5:10, 2, .HZID]
#> [1] 15 18 20 22 24 29

# this is more realistic, perhaps:
fivetoten <- sp4[5:10,] # or some more complicated subset()

# third horizon ID, index to horizon data.frame
horizons(fivetoten)[fivetoten[,3,,.HZID],]
#>                id name top bottom   K   Mg   Ca CEC_7 ex_Ca_to_Mg sand silt
#> 3       mendocino  Bt2   8     30 0.2 30.5  3.7  22.9        0.12   51   26
#> 12 shasta-trinity   AB   5     12 0.3 29.3  3.5  29.6        0.12   24   36
#> 17         tehama  Bt2   7     16 0.2 27.7 13.7  30.0        0.50   51   17
#>    clay   CF hzID
#> 3    23 0.80   16
#> 12   40 0.09   25
#> 17   32 0.34   30

# use it to implement #199
getLastHorizonID_v1 <- function(x) {
  res <-  hzID(x[, , .LAST])
  names(res) <- profile_id(x)
  return(res)
}

# ~3x more efficient using j-index shortcut
getLastHorizonID_v2 <- function(x) {
  res <-  hzID(x)[x[, , .LAST, .HZID]]
  names(res) <- profile_id(x)
  return(res)
}

bench::mark(getLastHorizonID_v1(sp4),
            getLastHorizonID_v2(sp4))
#> # A tibble: 2 x 6
#>   expression                    min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 getLastHorizonID_v1(sp4)   3.04ms   3.51ms      271.        NA     10.7
#> 2 getLastHorizonID_v2(sp4) 957.26µs   1.07ms      867.        NA     12.8

@dylanbeaudette
Copy link
Member

Really excited about this PR!! I've got a couple of new uses. Will try to finish review tonight. Nice!

# handle special keywords in "k" index
for(k in ksub) {
if (!is.na(k)) {
switch(k,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very clever, good idea.

@brownag brownag marked this pull request as ready for review February 24, 2021 16:36
@brownag brownag changed the title Further optimization of SPC j-index Further optimization of SPC j-index and adding k-keywords Feb 24, 2021
@brownag brownag changed the title Further optimization of SPC j-index and adding k-keywords data.table optimization of SPC j-index and adding k-keywords Feb 24, 2021
@brownag brownag merged commit 9a53405 into master Feb 24, 2021
@dylanbeaudette dylanbeaudette mentioned this pull request Feb 24, 2021
19 tasks
@brownag brownag deleted the dtspcj branch March 3, 2021 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants