-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data.table optimization of SPC j-index and adding k-keywords #201
Conversation
For what it is worth, I tried several more optimizations on the The aggregate EDIT: show benchmark with minimum iterations = 10
|
I really didn't intend to work on this over the long weekend, but got excited when I saw these opportunities. We need to strongly consider tradeoffs of set-up/teardown of data.table approaches, considering subset operations are disproportionately applied to smaller SPCs, optimizing specifically for large SPCs may be the wrong target. Triggering more performant code at a |
Thanks for putting this together. This is an important consideration, thanks for suggesting it. I was starting to wonder about the intersection of relative performance "curves" for base vs. DT. We should outline a strategy (affected functions, threshold for switching implementation) before bolting on parallel implementation. Also, it would be ideal if switching from base to DT be constrained to the lowest level of abstraction as possible--I don't look forward to maintaining / debugging forking paths in every function. I'm going to make a similar suggestion in #199. |
It would be cool if the SPC j-index (or a new k-index?) supported |
This is a neat idea. I've had thoughts about making something like this, but in the form of a "slice" index |
So, I got # DEMO: base non-standard eval of keyword in ... "k-index"
library(aqp, warn = FALSE)
#> This is aqp 1.27
data(sp4)
depths(sp4) <- id ~ top + bottom
# .LAST: last horizon in profiles 5 thru 10
sp4[5:10, , .LAST]
#> SoilProfileCollection with 6 profiles and 6 horizons
#> profile ID: id | horizon ID: hzID
#> Depth range: 16 - 40 cm
#>
#> ----- Horizons (6 / 6 rows | 10 / 14 columns) -----
#> id hzID top bottom name K Mg Ca CEC_7 ex_Ca_to_Mg
#> mendocino 16 8 30 Bt2 0.2 30.5 3.7 22.9 0.12
#> napa 18 6 20 Bt 0.1 16.2 21.5 27.9 1.32
#> san benito 20 8 20 Bt 0.0 0.1 5.6 5.6 0.11
#> shasta 22 3 40 Bt 0.2 10.1 2.0 12.2 0.20
#> shasta-trinity 27 23 40 Bt2 0.1 64.9 0.8 48.7 0.01
#> tehama 30 7 16 Bt2 0.2 27.7 13.7 30.0 0.50
#>
#> ----- Sites (6 / 6 rows | 1 / 1 columns) -----
#> id
#> mendocino
#> napa
#> san benito
#> shasta
#> shasta-trinity
#> tehama
#>
#> Spatial Data: [EMPTY]
# use it to implement #199
getLastHorizonID <- function(x) {
res <- hzID(x[, , .LAST])
names(res) <- profile_id(x)
return(res)
}
getLastHorizonID(sp4)
#> colusa glenn kings mariposa mendocino
#> "4" "6" "9" "13" "16"
#> napa san benito shasta shasta-trinity tehama
#> "18" "20" "22" "27" "30" I think f109923 lays down a pretty extensible framework for other types of operations in this But for instance, I could implement I think it makes sense to do this in All that said, I am not super attached to the idea, just showing how it could be implemented. |
I do like the idea of extended Nice work. |
More tinkering with some new
EDIT: fix for i+j+k indexing combined # DEMO: base non-standard eval of keyword in ... "k-index": SPC[i, j, ...]
# version 1: add support for .LAST
# version 2: add support for .FIRST, .HZID and multiple special keywords
library(aqp, warn = FALSE)
#> This is aqp 1.27
data(sp4)
depths(sp4) <- id ~ top + bottom
# .LAST
sp4[, , .LAST]
#> SoilProfileCollection with 10 profiles and 10 horizons
#> profile ID: id | horizon ID: hzID
#> Depth range: 16 - 49 cm
#>
#> ----- Horizons (6 / 10 rows | 10 / 14 columns) -----
#> id hzID top bottom name K Mg Ca CEC_7 ex_Ca_to_Mg
#> colusa 4 30 42 Bt2 0.1 44.3 0.3 43.0 0.01
#> glenn 6 9 34 Bt 0.3 18.9 4.5 27.5 0.20
#> kings 9 13 40 Bt2 0.8 17.7 4.4 20.0 0.25
#> mariposa 13 34 49 Bt3 0.1 78.2 4.4 43.6 0.06
#> mendocino 16 8 30 Bt2 0.2 30.5 3.7 22.9 0.12
#> napa 18 6 20 Bt 0.1 16.2 21.5 27.9 1.32
#> [... more horizons ...]
#>
#> ----- Sites (6 / 10 rows | 1 / 1 columns) -----
#> id
#> colusa
#> glenn
#> kings
#> mariposa
#> mendocino
#> napa
#> [... more sites ...]
#>
#> Spatial Data: [EMPTY]
sp4[, , .HZID]
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30
sp4[, , .LAST, .HZID]
#> [1] 4 6 9 13 16 18 20 22 27 30
sp4[, , .FIRST, .HZID] # this just sets j <- 1
#> [1] 1 5 7 10 14 17 19 21 23 28
sp4[, 1, , .HZID]
#> [1] 1 5 7 10 14 17 19 21 23 28
sp4[, 1000, .FIRST, .HZID] # this sets j <- 1; ignores j input if given
#> [1] 1 5 7 10 14 17 19 21 23 28
sp4[, 1000, .LAST, .HZID] # this ignores j input if given
#> [1] 4 6 9 13 16 18 20 22 27 30
# horizon index of 2nd horizon in each profile
sp4[5:10, 2, .HZID]
#> [1] 15 18 20 22 24 29
# this is more realistic, perhaps:
fivetoten <- sp4[5:10,] # or some more complicated subset()
# third horizon ID, index to horizon data.frame
horizons(fivetoten)[fivetoten[,3,,.HZID],]
#> id name top bottom K Mg Ca CEC_7 ex_Ca_to_Mg sand silt
#> 3 mendocino Bt2 8 30 0.2 30.5 3.7 22.9 0.12 51 26
#> 12 shasta-trinity AB 5 12 0.3 29.3 3.5 29.6 0.12 24 36
#> 17 tehama Bt2 7 16 0.2 27.7 13.7 30.0 0.50 51 17
#> clay CF hzID
#> 3 23 0.80 16
#> 12 40 0.09 25
#> 17 32 0.34 30
# use it to implement #199
getLastHorizonID_v1 <- function(x) {
res <- hzID(x[, , .LAST])
names(res) <- profile_id(x)
return(res)
}
# ~3x more efficient using j-index shortcut
getLastHorizonID_v2 <- function(x) {
res <- hzID(x)[x[, , .LAST, .HZID]]
names(res) <- profile_id(x)
return(res)
}
bench::mark(getLastHorizonID_v1(sp4),
getLastHorizonID_v2(sp4))
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 getLastHorizonID_v1(sp4) 3.04ms 3.51ms 271. NA 10.7
#> 2 getLastHorizonID_v2(sp4) 957.26µs 1.07ms 867. NA 12.8 |
Really excited about this PR!! I've got a couple of new uses. Will try to finish review tonight. Nice! |
# handle special keywords in "k" index | ||
for(k in ksub) { | ||
if (!is.na(k)) { | ||
switch(k, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very clever, good idea.
This was inspired by discussion on #199, a continuation of the efforts in #157 and of prior optimizations of SPC operators, and customizations for data.table SPCs in #155. This PR is extending the benefits for all SPC data.frame subclasses.
EDIT:
Triggering more performant code at a length(SPC) threshold e.g. 10000 may be an idea to consider.EDIT2:
Agreeing with Dylan below. I reverted this to draft for now.
Some more explanation. Before this PR, data.table SPC j-index should only be invoked if the
aqp_df_class
of the object is"data.table"
. This PR removes one of the few code "forks" between data.table and base in favor of a data.table.We could more formally "archive" base R equivalent code using in tests where things should be expected to be the same. For now, I left mine commented, and several new realizations of the base code in this PR worth trying out.
In some way I see git/PR/issue history as documentation of some decisions and options, but that isn't ideal. The current state of things in master reflects my best judgment [at the time] about when to invoke data.table -- usually based on some fairly thoughtful benchmarks, but imperfect and a product of their time to be sure. As this PR reveals.
I think I would be willing to accept a small performance hit for small SPCs in exchange for better scaling. Possibly being less if we adopted data.table as a paradigm, unclear to me at this point where/when exactly the performance hit happens.
There may be plenty of cases where using data.table rather than a base approach is not necessary (even if the object is a data.table). Please check out the jindex-benchmarks.R file on this
dtspcj
branch if interested, but the small-SPC benchmark "concerning" me is this oneThe differences between base 2, base 3 and data.table 2 are pretty marginal at this scale. Overall data.table appears about 5-10% slower by all stats -- for order of magnitude of 10^0 / 10^1 profile SPCs. By 100 profiles data.table is ahead.
Another thing to consider is that data.table can probably (?) make better use of e.g. multiple cores. But is that "overkill" for tiny operations invoked behind the scenes that can't be (or aren't) effectively parallelized? I need to play around more with this.
Finally, are Windows machines able to realize full benefits of data.table? How about ones with crappy real-time scanning? Can we set up a matrix comparison of architectures, number of cores, etc?
Original post
I realized the j-index extraction could be made more efficient than base, thus allowing me to use the data.table code even when
aqp_df_class(object) != "data.table"
. There was an extra costly operation in there that was not needed that was the difference between outperforming base or not on smaller collections.I added a demo in the /misc/aqp2 folder where many of my "big SPC" optimizations were being tested. This shows a significant move towards more performant SPC operations that comes from a better understanding of memory management and use of data.table that I have been working on over past several months.
The following benchmark compares the base (
aggregate
-based) code that I optimized several months ago, the "optional" less efficient DT code, and the new better DT code on a random 100k profile collection (see demo .R file)NOTE: these benchmarks do not include any i index subsetting, despite
1:10
as i argument. i-index subsetting is orders of magnitude faster and can be readily accomplished with base.