-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first cut at abstracting the fixing of NA bottom depths, safely constrained to the deepest horizon per profile #199
Conversation
…rained to the deepest horizon per profile
Have a look, let me know what you think as time permits. I've tried to make the abstraction safer than the original implementation so that we don't accidentally create "worse off" profiles in the presence of NA above the deepest profile in a collection. This required a new function |
This looks good to me. It will be handy to have a function to do this operation for a variety of reasons. Here is an alternate approach using {data.table}--for your consideration. It passes the current tests. getLastHorizonID <- function(x) {
# with data.table
hztb <- horizonDepths(x)
hzidn <- hzidname(x)
idn <- idname(x)
# iterate over profile horizon subset data.frame; return deepest horizon IDs
.SD <- NULL
.N <- NULL
h <- data.table::as.data.table(horizons(x))
h.sub <- h[, .SD[.N,], by = list(h[[idn]])]
res <- h.sub[[hzidn]]
names(res) <- h.sub[[idn]]
# a named vector, in the same order as profile_id(x)
return(res)
} Some {data.table} notes: I have not come up with a data.table style guide preference, but have started to ponder it at least in terms of stuff i have put in aqp. Some of my existing code could certainly be cleaned up stylistically. In this case, using One can |
This is great, thanks for the review and alternative. I'll move the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking as approved. Do what ye will with data.table. I support moving the order test outside the function -- as in my simple understanding there shoudn't be a way for that thing to trigger. May need some tests with severely corrupted input SPCs.
This would be a fine addition: I'm still not 100% certain with some of my recent {plyr} and {reshape} replacements. |
cannot account for every possible issue here and resisting the temptatation to automatically invoke repairMissingHzDepths()
Sorry I'm late to this discussion, but I have an alternative suggestion. I tested it on my snapshot of the NASIS pedons, all 569,661, and compared it to Andrew's data.table example above. It appears to be significantly faster (6 vs 60 sec), even if the data needs to be ordered first. The results are the same for both methods. I provided 2 examples on how this answer this question using aggregate().
|
Thanks Stephen for pushing further on benchmarks. I suggested data.table offhand, and like I said I didn't benchmark it. My code actually is slower than Dylan's code, slightly! I can look more at this during the week, but I have provided a more "competitive" data.table solution, can you try it on the snapshot and see if it at least scales similarly to base? If it is not as good, then I would favor keeping a base solution. Your aggregate "Option 2" would be my preference based on benchmarks I think! We have run into similar cases--like where this is applied in Edit: changed example slightly after original post, data.table was returning the row index (hzID in test set) but the original function is supposed to get the # dylans original code
getLastHorizonID_sapply <- function(x) {
hztb <- horizonDepths(x)
hzidn <- hzidname(x)
idn <- idname(x)
# basic idea: iterate over profiles, but only within horizon data
# return deepest horizon IDs
h <- horizons(x)
h <- split(h, h[[idn]])
# this is the safe set of horizons which can be repaired
res <- sapply(h, function(i) {
bottom.idx <- which.max(i[[hztb[1]]])
res <- i[[hzidn]][bottom.idx]
return(res)
})
## TODO: this should never happen, including until tests are complete
# just in case, ensure that profile order has not been compromised
if(! all(profile_id(x) == names(res))) {
stop('results out of order, how did this happen?', call. = FALSE)
}
# a named vector, in the same order as profile_id(x)
return(res)
}
# my original suggested code (not actually faster!)
getLastHorizonID_DT_original <- function(x) {
# with data.table
hztb <- horizonDepths(x)
hzidn <- hzidname(x)
idn <- idname(x)
# iterate over profile horizon subset data.frame; return deepest horizon IDs
.SD <- NULL
.N <- NULL
h <- data.table::as.data.table(horizons(x))
h.sub <- h[, .SD[.N,], by = list(h[[idn]])]
res <- h.sub[[hzidn]]
names(res) <- h.sub[[idn]]
# a named vector, in the same order as profile_id(x)
return(res)
}
# stephen suggestion #1
getLastHorizonID_aggregate1 <- function(spc, order = FALSE) {
hztb <- horizonDepths(spc)
hzidn <- hzidname(spc)
id <- idname(spc)
h <- horizons(spc)
if (order == TRUE) {
h <- h[order(h[[id]], h[[hztb[1]]], h[[hztb[2]]]), ]
}
# option 1
maxdep <- aggregate(x = h[hztb],
by = list(id = h[[id]]),
FUN = function(x) length(x)#,
# na.action = na.omit
)
idx_h <- cumsum(maxdep[[hztb[2]]])
option_1 <- h[idx_h, hzidn]
# add names
names(option_1) <- profile_id(spc)
return(option_1)
}
# stephen suggestion #2
getLastHorizonID_aggregate2 <- function(spc, order = FALSE) {
hztb <- horizonDepths(spc)
hzidn <- hzidname(spc)
id <- idname(spc)
h <- horizons(spc)
if (order == TRUE) {
h <- h[order(h[[id]], h[[hztb[1]]], h[[hztb[2]]]), ]
}
# option 2
h$rn <- 1:nrow(h)
maxdep <- aggregate(x = h[["rn"]],
by = list(id = h[[id]]),
FUN = function(x) x[which.max(x)] #,
# na.action = na.omit
)
option_2 <- h[maxdep$x, hzidn]
names(option_2) <- profile_id(spc)
return(option_2)
}
# new data.table code
getLastHorizonID_DT_new <- function(x) {
# with data.table
hztb <- horizonDepths(x)
hzidn <- hzidname(x)
idn <- idname(x)
# iterate over profile horizon subset data.frame; return deepest horizon IDs
.SD <- NULL
.N <- NULL
h <- data.table::as.data.table(horizons(x))
res <- h[, .I[which.max(.SD[[hztb[2]]])],
by = list(h[[idn]]),
.SDcols = c(hztb[2])]$V1
res <- hzID(x)[res]
names(res) <- profile_id(x)
# a named vector, in the same order as profile_id(x)
return(res)
}
# benchmark with sp4
library(aqp, warn.conflicts=FALSE)
#> This is aqp 1.27
data(sp4)
depths(sp4) <- id ~ top + bottom
bench::mark(
getLastHorizonID_sapply(sp4),
getLastHorizonID_DT_original(sp4),
getLastHorizonID_DT_new(sp4),
getLastHorizonID_aggregate1(sp4),
getLastHorizonID_aggregate2(sp4)
)
#> # A tibble: 5 x 6
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 getLastHorizonID_sapply(sp4) 1.89ms 2.02ms 492. NA
#> 2 getLastHorizonID_DT_original(sp4) 2.07ms 2.39ms 418. NA
#> 3 getLastHorizonID_DT_new(sp4) 909.07µs 1.01ms 980. NA
#> 4 getLastHorizonID_aggregate1(sp4) 945.92µs 1.05ms 939. NA
#> 5 getLastHorizonID_aggregate2(sp4) 867.38µs 956.44µs 999. NA
#> # … with 1 more variable: `gc/sec` <dbl> Created on 2021-02-14 by the reprex package (v0.3.0) |
Below are the results from my snapshot reusing your above example. Your new DT example is the fastest, by a 10th of a second. I excluded Dylan's example because I got tired of waiting (split does not appear to scale well). I'm sure their are some use cases where DT will be far and away the fastest, but in general my preference would be to defer to base when the results are comparable. Interestingly the aggregate2 option uses 0.5GB less memory than the DT_new example.
|
A less-memory and faster optimization of data.table code can be done if one is willing to forgo
|
And, if i am not mistaken? the getLastHorizonID_DT_new3 <- function(x) {
# with data.table
hztb <- horizonDepths(x)
hzidn <- hzidname(x)
idn <- idname(x)
# iterate over profile horizon subset data.frame; return deepest horizon IDs
.SD <- NULL
.N <- NULL
.I <- NULL
h <- data.table::as.data.table(horizons(x))
idx <- h[, .I[.N], by = list(h[[idn]])]$V1
res <- hzID(x)[idx]
names(res) <- profile_id(x)
# a named vector, in the same order as profile_id(x)
return(res)
} |
Nice work @smroecker and @brownag, you guys pick your favorite implementation: fast and correct are my only criteria. Good thinking to keep memory footprint low. Merge when happy. |
On second thought: I'm going to merge and then splice-in the |
Why would we not sort that out here? |
Yeah, ok. I'm going to leave this to you and Stephen to agree on something. I put forth the first draft to get the ball rolling and you two have extended the performance considerably. I would like all of us to consider while optimizing code:
With that in mind, pick a solution that works for the medium-term, merge, and then continue to optimize as time permits. As fun as it can be to play code golf, we have to balance that tendency with getting reliable code into master. Thanks for all of the input and have fun finishing this PR. |
Those considerations sound fine to me. I would like to review the tests for these functions during the work week. But can do that ASAP.
As a comment on performance and accuracy combined, I think we should in this case be relying on something like
Sorry that I did not provide better explanation of the specific DT keywords. Note that the only thing that changed between my original suggestion and most recent suggestion is swapping
I am pretty sure that the last data.table code I provided is as-performant if not more so than base in this case. Small and large collections. The combination of That package-performance-breakdown may is not universally true, and clearly there are many ways to do the same thing. I leaned towards .SD because of your original solution working over data.frames, but as Stephen rightly observed in his suggestion this is a simpler vector operation. A couple points:
Thanks for putting all of these issues together. |
I updated the function that I suggested and was added to this branch by dylan. With my latest suggestion using If @smroecker is OK with it, we can merge this at @dylanbeaudette 's convenience and discuss the merits of a base implementations of various methods in #157 or a new issue. In #201 I highlight some related concepts that probably warrant discussion there. Upon reflection, that discussion would be easier with all the latest code in master. We probably cannot settle the base v.s. data.table issue in this PR, it is pretty philosophical. Despite being in support of the long-term goals of data.table conversion, I for one can see the value of retaining base behavior for near every aqp feature. I know that we do not prefer branching of base and DT options, but I still think there is value in preserving no-compiled-code behavior of key functions even if just for posterity/backwards compatibility. My original point was to just consider data.table, not really to get into a benchmarking war that goes to the core of the SPC... or to dredge up my unresolved old data.table issues. I know @dylanbeaudette said he wanted to use this new pair of functions, and the original probably worked fine for your case. |
The new results are impressive. I support adding Andrew's new verison. I will have to become more familiar with the data.table syntax moving forward.
|
It is cool to see that below 1 second! I think we could have base implementation Stephen suggested incorporated as part of a new test of the data.table-powered function. As a sortof formal way of recognizing equality, more so than a need for that test per se. We never really know for sure what may happen to data.table syntax in the future, for instance. I.e. have the function powered by data.table do its thing on a sample SPC, then do it independently with base on |
I'll be sure to copy important notes from this PR into a more visible location. The trade-offs in development time / complexity will be important considerations on into the future. |
I added a case to the misc/sandbox/find-fix-NA-depths.R file for an edge case with multiple I am not sure what the expected/desired behavior is so I have not yet written any new test. The current behavior (not doing anything) may be what is desired, so we could simply write a test that acknowledges that. I provide an example where in a source data.frame two top depths in a profile are This is unlikely to occur (at least from our common data sources), but it illustrates the basic concept that ID+topdepth sorting for the SPC will not adequately account for multiple In my mind -- the I think library(aqp, warn.conflicts=FALSE)
# extra bad case -- where two top depths are NA cannot know which is the "bottom most"
data(sp4)
x2 <- sp4
x2$top[3:4] <- NA
x2$bottom[3:4] <- x2$bottom[4:3]
# we get a warning on build -- top depths with NA are extra bad news
depths(x2) <- id ~ top + bottom
#> Warning: Horizon top depths contain NA! Check depth logic with
#> aqp::checkHzDepthLogic()
# this is actually technically possible, tho unlikely as bottom depth is ignored by SPC sort
# invalid depth logic in profile 1 (colusa)
checkHzDepthLogic(x2)
#> id valid depthLogic sameDepth missingDepth overlapOrGap
#> 1 colusa FALSE TRUE FALSE TRUE FALSE
#> 2 glenn TRUE FALSE FALSE FALSE FALSE
#> 3 kings TRUE FALSE FALSE FALSE FALSE
#> 4 mariposa TRUE FALSE FALSE FALSE FALSE
#> 5 mendocino TRUE FALSE FALSE FALSE FALSE
#> 6 napa TRUE FALSE FALSE FALSE FALSE
#> 7 san benito TRUE FALSE FALSE FALSE FALSE
#> 8 shasta TRUE FALSE FALSE FALSE FALSE
#> 9 shasta-trinity TRUE FALSE FALSE FALSE FALSE
#> 10 tehama TRUE FALSE FALSE FALSE FALSE
# plotSPC
# NB: shows the horizons even though they have no top depth
plotSPC(x2)
# inspect
horizons(x2[1,])
#> id name top bottom K Mg Ca CEC_7 ex_Ca_to_Mg sand silt clay CF
#> 1 colusa A 0 3 0.3 25.7 9.0 23.0 0.35 46 33 21 0.12
#> 2 colusa ABt 3 8 0.2 23.7 5.6 21.4 0.23 42 31 27 0.27
#> 3 colusa Bt1 NA 42 0.1 23.2 1.9 23.7 0.08 40 28 32 0.27
#> 4 colusa Bt2 NA 30 0.1 44.3 0.3 43.0 0.01 27 18 55 0.16
#> hzID
#> 1 1
#> 2 2
#> 3 3
#> 4 4
# do the repair
z2 <- repairMissingHzDepths(x2)
plot(z2) # result is unaltered -- is this what we expect? / can we _do_ anything w/ NA top depths?
# in the absence of top depth, order cannot be guaranteed
# the most common sources of NA top depths will also have bottom be NA -- e.g. empty NASIS record
horizons(z2[1,])
#> id name top bottom K Mg Ca CEC_7 ex_Ca_to_Mg sand silt clay CF
#> 1 colusa A 0 3 0.3 25.7 9.0 23.0 0.35 46 33 21 0.12
#> 2 colusa ABt 3 8 0.2 23.7 5.6 21.4 0.23 42 31 27 0.27
#> 3 colusa Bt1 NA 42 0.1 23.2 1.9 23.7 0.08 40 28 32 0.27
#> 4 colusa Bt2 NA 30 0.1 44.3 0.3 43.0 0.01 27 18 55 0.16
#> hzID .repaired
#> 1 1 FALSE
#> 2 2 FALSE
#> 3 3 FALSE
#> 4 4 FALSE |
And one point on the all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave it up to you all to decide whether NA top depth effects on sorting and repairing missing depths in general is under the purview of this PR.
That is something left to another issue. I think that that the "missing bottom depth of the deepest horizon" case is a very specific example of missing horizon depths that we can actually do something (reasonable) about. Missing top depths is the realm of corrupt data and "fixing" becomes much more involved. |
And again--thanks for all of the hard work on this one. |
* Use data.table for all j-index subsetting #157 * Significant optimization for j-index subsetting of SPC * Add more competitive version of base function for j-index * spc_j_base_2: 100k benchmark ~0.1s faster than original, nearly 3x slower than new DT * spc_j_base_3: more competitive for small SPCs * add demo of SPC non-standard eval "k-index" .LAST keyword #199 #201 * add experimental support for .FIRST, .HZID and multiple special keywords * better logic for i+j+k * Convert demo to tests * Docs
addressing #198