Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first cut at abstracting the fixing of NA bottom depths, safely constrained to the deepest horizon per profile #199

Merged
merged 6 commits into from
Feb 18, 2021

Conversation

dylanbeaudette
Copy link
Member

addressing #198

@dylanbeaudette
Copy link
Member Author

Have a look, let me know what you think as time permits. I've tried to make the abstraction safer than the original implementation so that we don't accidentally create "worse off" profiles in the presence of NA above the deepest profile in a collection. This required a new function getLastHorizonID, I'd be happy to know if there is a more elegant approach.

@brownag
Copy link
Member

brownag commented Feb 12, 2021

This looks good to me. It will be handy to have a function to do this operation for a variety of reasons.

Here is an alternate approach using {data.table}--for your consideration. It passes the current tests.
I didn't benchmark it but it should be faster than iterating over a split() list of data.frame.

getLastHorizonID <- function(x) {
  # with data.table
  hztb <- horizonDepths(x)
  hzidn <- hzidname(x)
  idn <- idname(x)
  
  # iterate over profile horizon subset data.frame; return deepest horizon IDs
  .SD <- NULL
  .N <- NULL
  h <- data.table::as.data.table(horizons(x))
  h.sub <- h[, .SD[.N,], by = list(h[[idn]])]
  res <- h.sub[[hzidn]]
  names(res) <- h.sub[[idn]]
  
  # a named vector, in the same order as profile_id(x)
  return(res)
}

Some {data.table} notes: I have not come up with a data.table style guide preference, but have started to ponder it at least in terms of stuff i have put in aqp. Some of my existing code could certainly be cleaned up stylistically. In this case, using .SD/.N makes sense for your last horizon case -- and there are a couple ways to supply idname to by=.

One can substitute(idname(x)) then by=c(eval(idn)), or one can just use by=list(h[[idn]]). I think the latter that I do here is easier to think about and less likely to go haywire.

@dylanbeaudette
Copy link
Member Author

This is great, thanks for the review and alternative. I'll move the all(profile_id(x) == names(res)) check to the actual tests.

Copy link
Member

@brownag brownag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as approved. Do what ye will with data.table. I support moving the order test outside the function -- as in my simple understanding there shoudn't be a way for that thing to trigger. May need some tests with severely corrupted input SPCs.

@dylanbeaudette
Copy link
Member Author

Some {data.table} notes: I have not come up with a data.table style guide preference, but have started to ponder it at least in terms of stuff i have put in aqp. Some of my existing code could certainly be cleaned up stylistically. In this case, using .SD/.N makes sense for your last horizon case -- and there are a couple ways to supply idname to by=.

One can substitute(idname(x)) then by=c(eval(idn)), or one can just use by=list(h[[idn]]). I think the latter that I do here is easier to think about and less likely to go haywire.

This would be a fine addition: I'm still not 100% certain with some of my recent {plyr} and {reshape} replacements.

cannot account for every possible issue here and resisting the
temptatation to automatically invoke repairMissingHzDepths()
@smroecker
Copy link
Member

Sorry I'm late to this discussion, but I have an alternative suggestion. I tested it on my snapshot of the NASIS pedons, all 569,661, and compared it to Andrew's data.table example above. It appears to be significantly faster (6 vs 60 sec), even if the data needs to be ordered first. The results are the same for both methods. I provided 2 examples on how this answer this question using aggregate().

getLastHorizonID <- function(spc, order = FALSE) {
  
  hztb  <- horizonDepths(spc)
  hzidn <- hzidname(spc)
  id    <- idname(spc)
  
  h <- horizons(spc)
  
  if (order == TRUE) {
     h <- h[order(h[[id]], h[[hztb[1]]], h[[hztb[2]]]), ]
  }
  
  # option 1
  maxdep <- aggregate(x   = h[hztb], 
                      by  = list(id = h[[id]]),
                      FUN = function(x) length(x)#,
#                      na.action = na.omit
                      )
  idx_h <- cumsum(maxdep$hzdept)
  option_1 <- h[idx_h, hzidn]
  
  
  # option 2
  h$rn <- 1:nrow(h)
  maxdep <- aggregate(x   = h[["rn"]], 
                     by  = list(id = h[[id]]),
                     FUN = function(x) x[which.max(x)] #,
                     #                      na.action = na.omit
  )
  option_2 <- h[maxdep$x, hzidn]
  
  return(data.frame(option_1, option_2))
  
}

@brownag
Copy link
Member

brownag commented Feb 14, 2021

Thanks Stephen for pushing further on benchmarks. I suggested data.table offhand, and like I said I didn't benchmark it. My code actually is slower than Dylan's code, slightly!

I can look more at this during the week, but I have provided a more "competitive" data.table solution, can you try it on the snapshot and see if it at least scales similarly to base? If it is not as good, then I would favor keeping a base solution.

Your aggregate "Option 2" would be my preference based on benchmarks I think! We have run into similar cases--like where this is applied in min(SPC) / max(SPC) where data.table does not outperform tapply.

Edit: changed example slightly after original post, data.table was returning the row index (hzID in test set) but the original function is supposed to get the hzidcol. Makes my solution a bit slower.

# dylans original code
getLastHorizonID_sapply <- function(x) {
  
  hztb <- horizonDepths(x)
  hzidn <- hzidname(x)
  idn <- idname(x)
  
  # basic idea: iterate over profiles, but only within horizon data
  # return deepest horizon IDs
  h <- horizons(x)
  h <- split(h, h[[idn]])
  
  # this is the safe set of horizons which can be repaired
  res <- sapply(h, function(i) {
    bottom.idx <- which.max(i[[hztb[1]]])
    res <- i[[hzidn]][bottom.idx]
    return(res)
  })
  
  ## TODO: this should never happen, including until tests are complete
  # just in case, ensure that profile order has not been compromised
  if(! all(profile_id(x) == names(res))) {
    stop('results out of order, how did this happen?', call. = FALSE)
  }
  
  # a named vector, in the same order as profile_id(x)
  return(res)
}

# my original suggested code (not actually faster!)
getLastHorizonID_DT_original <- function(x) {
  # with data.table
  hztb <- horizonDepths(x)
  hzidn <- hzidname(x)
  idn <- idname(x)
  
  # iterate over profile horizon subset data.frame; return deepest horizon IDs
  .SD <- NULL
  .N <- NULL
  h <- data.table::as.data.table(horizons(x))
  h.sub <- h[, .SD[.N,], by = list(h[[idn]])]
  res <- h.sub[[hzidn]]
  names(res) <- h.sub[[idn]]
  
  # a named vector, in the same order as profile_id(x)
  return(res)
}

# stephen suggestion #1
getLastHorizonID_aggregate1 <- function(spc, order = FALSE) {
  
  hztb  <- horizonDepths(spc)
  hzidn <- hzidname(spc)
  id    <- idname(spc)
  
  h <- horizons(spc)
  
  if (order == TRUE) {
    h <- h[order(h[[id]], h[[hztb[1]]], h[[hztb[2]]]), ]
  }
  
  # option 1
  maxdep <- aggregate(x   = h[hztb], 
                      by  = list(id = h[[id]]),
                      FUN = function(x) length(x)#,
                      #                      na.action = na.omit
  )
  idx_h <- cumsum(maxdep[[hztb[2]]])
  option_1 <- h[idx_h, hzidn]
  
  # add names
  names(option_1) <- profile_id(spc)
  
  return(option_1)
  
}

# stephen suggestion #2
getLastHorizonID_aggregate2 <- function(spc, order = FALSE) {
  
  hztb  <- horizonDepths(spc)
  hzidn <- hzidname(spc)
  id    <- idname(spc)
  
  h <- horizons(spc)
  
  if (order == TRUE) {
    h <- h[order(h[[id]], h[[hztb[1]]], h[[hztb[2]]]), ]
  }
  
  # option 2
  h$rn <- 1:nrow(h)
  maxdep <- aggregate(x   = h[["rn"]], 
                      by  = list(id = h[[id]]),
                      FUN = function(x) x[which.max(x)] #,
                      #                      na.action = na.omit
  )
  option_2 <- h[maxdep$x, hzidn]
  names(option_2) <- profile_id(spc)
  return(option_2)
}

# new data.table code
getLastHorizonID_DT_new <- function(x) {
  # with data.table
  hztb <- horizonDepths(x)
  hzidn <- hzidname(x)
  idn <- idname(x)
  
  # iterate over profile horizon subset data.frame; return deepest horizon IDs
  .SD <- NULL
  .N <- NULL
  
  h <- data.table::as.data.table(horizons(x))
  
  res <- h[, .I[which.max(.SD[[hztb[2]]])], 
           by = list(h[[idn]]), 
           .SDcols = c(hztb[2])]$V1
  res <- hzID(x)[res]
  names(res) <- profile_id(x)
  
  # a named vector, in the same order as profile_id(x)
  return(res)
}

# benchmark with sp4
library(aqp, warn.conflicts=FALSE)
#> This is aqp 1.27

data(sp4)
depths(sp4) <- id ~ top + bottom

bench::mark(
  getLastHorizonID_sapply(sp4),
  getLastHorizonID_DT_original(sp4),
  getLastHorizonID_DT_new(sp4),
  getLastHorizonID_aggregate1(sp4),
  getLastHorizonID_aggregate2(sp4)
)
#> # A tibble: 5 x 6
#>   expression                             min   median `itr/sec` mem_alloc
#>   <bch:expr>                        <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 getLastHorizonID_sapply(sp4)        1.89ms   2.02ms      492.        NA
#> 2 getLastHorizonID_DT_original(sp4)   2.07ms   2.39ms      418.        NA
#> 3 getLastHorizonID_DT_new(sp4)      909.07µs   1.01ms      980.        NA
#> 4 getLastHorizonID_aggregate1(sp4)  945.92µs   1.05ms      939.        NA
#> 5 getLastHorizonID_aggregate2(sp4)  867.38µs 956.44µs      999.        NA
#> # … with 1 more variable: `gc/sec` <dbl>

Created on 2021-02-14 by the reprex package (v0.3.0)

@smroecker
Copy link
Member

Below are the results from my snapshot reusing your above example. Your new DT example is the fastest, by a 10th of a second. I excluded Dylan's example because I got tired of waiting (split does not appear to scale well). I'm sure their are some use cases where DT will be far and away the fastest, but in general my preference would be to defer to base when the results are comparable. Interestingly the aggregate2 option uses 0.5GB less memory than the DT_new example.

> bench::mark(
+   # getLastHorizonID_sapply(spc),
+   getLastHorizonID_DT_original(spc),
+   getLastHorizonID_DT_new(spc),
+   getLastHorizonID_aggregate1(spc),
+   getLastHorizonID_aggregate2(spc),
+   max_iterations = 10
+ )
# A tibble: 4 x 13
  expression                          min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                        <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 getLastHorizonID_DT_original(spc)    1m     1m    0.0166   10.91GB    0.282     1    17         1m
2 getLastHorizonID_DT_new(spc)      4.36s  4.36s    0.229     1.49GB    0.229     1     1      4.36s
3 getLastHorizonID_aggregate1(spc)  6.41s  6.41s    0.156     1.02GB    0.156     1     1      6.41s
4 getLastHorizonID_aggregate2(spc)   4.5s   4.5s    0.222   826.25MB    0         1     0       4.5s
# ... with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 
> 

@brownag
Copy link
Member

brownag commented Feb 14, 2021

A less-memory and faster optimization of data.table code can be done if one is willing to forgo .SD and checking that is robust to whether or not h is in id+top depth order -- which we can if it is from an SPC [under current rules and assuming that it is not corrupted]

  res <- h[, .I[.N], by = c(eval(substitute(idn)))]$V1
  res <- hzID(x)[res]
  names(res) <- profile_id(x)
# A tibble: 2 x 13
  expression                         min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                       <bch> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 getLastHorizonID_DT_new(sp4)     843µs 925.86µs     1056.        NA     4.14
2 getLastHorizonID_aggregate2(sp4) 942µs   1.04ms      922.        NA     6.30
# … with 7 more variables: n_itr <int>, n_gc <dbl>, total_time <bch:tm>,
#   result <list>, memory <list>, time <list>, gc <list>

@brownag
Copy link
Member

brownag commented Feb 14, 2021

And, if i am not mistaken? the by = list(h[[idn]]) syntax I suggested at the top of the thread but not in above post may make it faster/less memory still!

getLastHorizonID_DT_new3 <- function(x) {
  # with data.table
  hztb <- horizonDepths(x)
  hzidn <- hzidname(x)
  idn <- idname(x)
  
  # iterate over profile horizon subset data.frame; return deepest horizon IDs
  .SD <- NULL
  .N <- NULL
  .I <- NULL

  h <- data.table::as.data.table(horizons(x))
  
  idx <- h[, .I[.N], by = list(h[[idn]])]$V1
  res <- hzID(x)[idx]
  names(res) <- profile_id(x)
  
  # a named vector, in the same order as profile_id(x)
  return(res)
}

@dylanbeaudette
Copy link
Member Author

Nice work @smroecker and @brownag, you guys pick your favorite implementation: fast and correct are my only criteria. Good thinking to keep memory footprint low. Merge when happy.

@dylanbeaudette
Copy link
Member Author

Nice work @smroecker and @brownag, you guys pick your favorite implementation: fast and correct are my only criteria. Good thinking to keep memory footprint low. Merge when happy.

On second thought: I'm going to merge and then splice-in the getLastHorizonID we are happy with.

@brownag
Copy link
Member

brownag commented Feb 15, 2021

Nice work @smroecker and @brownag, you guys pick your favorite implementation: fast and correct are my only criteria. Good thinking to keep memory footprint low. Merge when happy.

On second thought: I'm going to merge and then splice-in the getLastHorizonID we are happy with.

Why would we not sort that out here?

@brownag brownag self-requested a review February 15, 2021 17:05
@dylanbeaudette
Copy link
Member Author

Nice work @smroecker and @brownag, you guys pick your favorite implementation: fast and correct are my only criteria. Good thinking to keep memory footprint low. Merge when happy.

On second thought: I'm going to merge and then splice-in the getLastHorizonID we are happy with.

Why would we not sort that out here?

Yeah, ok. I'm going to leave this to you and Stephen to agree on something. I put forth the first draft to get the ball rolling and you two have extended the performance considerably.

I would like all of us to consider while optimizing code:

  • ensure accuracy, in the presence of NA and other corner cases
  • efficiency for small and large collections
  • cognitive overhead: how much effort will it take to understand / debug later on
  • for those cases where ultra-compact and efficient code is appropriate, please explain in comments (I find that this is esp. important for me with some of the data.table syntax)

With that in mind, pick a solution that works for the medium-term, merge, and then continue to optimize as time permits. As fun as it can be to play code golf, we have to balance that tendency with getting reliable code into master.

Thanks for all of the input and have fun finishing this PR.

@brownag
Copy link
Member

brownag commented Feb 15, 2021

Those considerations sound fine to me. I would like to review the tests for these functions during the work week. But can do that ASAP.

ensure accuracy, in the presence of NA and other corner cases

As a comment on performance and accuracy combined, I think we should in this case be relying on something like nrow()/length()/.N of horizon data within profile IDs. That way we are not affected by any specific errors that may be occurring in the top or bottom depths. If we assume or enforce id+top depth sort, as is customary for an SPC, then I think we can safety do this (and get the performance benefits thereof).

for those cases where ultra-compact and efficient code is appropriate, please explain in comments

Sorry that I did not provide better explanation of the specific DT keywords. Note that the only thing that changed between my original suggestion and most recent suggestion is swapping .SD (the subset dataframe for each group) for .I which is a vector representing the row index. This is vastly superior in terms of memory, and can be used directly with aqp::hzID() to get hzID as needed. My intermediate solution is a bit different and is probably not always right, I think it uses bottom depth rather than just a row index. And it is also not the fastest.

With that in mind, pick a solution that works for the medium-term, merge, and then continue to optimize as time permits. As fun as it can be to play code golf, we have to balance that tendency with getting reliable code into master.

I am pretty sure that the last data.table code I provided is as-performant if not more so than base in this case. Small and large collections. The combination of .I and .N in a grouped data.table is probably hard to beat.

That package-performance-breakdown may is not universally true, and clearly there are many ways to do the same thing. I leaned towards .SD because of your original solution working over data.frames, but as Stephen rightly observed in his suggestion this is a simpler vector operation.

A couple points:

  • If we add support for ad hoc data.frames (rather than from horizons(SPC)) we need some pre-ordering code for ID/top depth, possibly. Like Stephen had in his suggestion.

  • Multiple NA top depths in a single profile with unknown bottom depth could be an issue, but that is a fairly intractable case. Again, open to a counterexamples, and I think I want to review the tests again.

Thanks for all of the input and have fun finishing this PR.

Thanks for putting all of these issues together.

@brownag
Copy link
Member

brownag commented Feb 15, 2021

I updated the function that I suggested and was added to this branch by dylan. With my latest suggestion using .I[.N]. Since the branch was reflecting the slower option using .SD. Passes current tests.

If @smroecker is OK with it, we can merge this at @dylanbeaudette 's convenience and discuss the merits of a base implementations of various methods in #157 or a new issue. In #201 I highlight some related concepts that probably warrant discussion there.

Upon reflection, that discussion would be easier with all the latest code in master.

We probably cannot settle the base v.s. data.table issue in this PR, it is pretty philosophical. Despite being in support of the long-term goals of data.table conversion, I for one can see the value of retaining base behavior for near every aqp feature. I know that we do not prefer branching of base and DT options, but I still think there is value in preserving no-compiled-code behavior of key functions even if just for posterity/backwards compatibility.

My original point was to just consider data.table, not really to get into a benchmarking war that goes to the core of the SPC... or to dredge up my unresolved old data.table issues. I know @dylanbeaudette said he wanted to use this new pair of functions, and the original probably worked fine for your case.

@smroecker
Copy link
Member

The new results are impressive. I support adding Andrew's new verison. I will have to become more familiar with the data.table syntax moving forward.

> bench::mark(
+   # getLastHorizonID_sapply(spc),
+   getLastHorizonID_DT_original(spc),
+   getLastHorizonID_DT_new3(spc),
+   getLastHorizonID_aggregate1(spc),
+   getLastHorizonID_aggregate2(spc),
+   max_iterations = 10
+ )
# A tibble: 4 x 13
  expression                             min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>                        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 getLastHorizonID_DT_original(spc)    1.02m    1.02m    0.0163   10.92GB    0.391     1    24
2 getLastHorizonID_DT_new3(spc)     915.04ms 915.04ms    1.09      1.25GB    0         1     0
3 getLastHorizonID_aggregate1(spc)     6.46s    6.46s    0.155     1.02GB    0.155     1     1
4 getLastHorizonID_aggregate2(spc)      4.4s     4.4s    0.227   826.25MB    0         1     0
# ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, time <list>,
#   gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 

@brownag
Copy link
Member

brownag commented Feb 15, 2021

It is cool to see that below 1 second!

I think we could have base implementation Stephen suggested incorporated as part of a new test of the data.table-powered function. As a sortof formal way of recognizing equality, more so than a need for that test per se. We never really know for sure what may happen to data.table syntax in the future, for instance.

I.e. have the function powered by data.table do its thing on a sample SPC, then do it independently with base on horizons(SPC), and compare withtestthat::expect_equal.

@dylanbeaudette
Copy link
Member Author

I'll be sure to copy important notes from this PR into a more visible location. The trade-offs in development time / complexity will be important considerations on into the future.

@brownag
Copy link
Member

brownag commented Feb 17, 2021

I added a case to the misc/sandbox/find-fix-NA-depths.R file for an edge case with multiple NA top depths.

I am not sure what the expected/desired behavior is so I have not yet written any new test. The current behavior (not doing anything) may be what is desired, so we could simply write a test that acknowledges that.

I provide an example where in a source data.frame two top depths in a profile are NA, but the bottom depths are present.

This is unlikely to occur (at least from our common data sources), but it illustrates the basic concept that ID+topdepth sorting for the SPC will not adequately account for multiple NA in top depth -- at least for purposes of "fixes" based on "last horizon"

In my mind -- the NA top and bottom case could probably be handled at least in the rudimentary "clean the data" sense by filtering to omit horizons with no depth reference at all.

I think NA in top depth would fall under purview of repairMissingHzDepths -- it is a missing depth -- but actually "fixing" it is more laden with assumptions and potential problems. What do you think?

library(aqp, warn.conflicts=FALSE)

# extra bad case -- where two top depths are NA cannot know which is the "bottom most"
data(sp4)
x2 <- sp4
x2$top[3:4] <- NA
x2$bottom[3:4] <- x2$bottom[4:3]

# we get a warning on build -- top depths with NA are extra bad news
depths(x2) <- id ~ top + bottom
#> Warning: Horizon top depths contain NA! Check depth logic with
#> aqp::checkHzDepthLogic()

# this is actually technically possible, tho unlikely as bottom depth is ignored by SPC sort

# invalid depth logic in profile 1 (colusa)
checkHzDepthLogic(x2)
#>                id valid depthLogic sameDepth missingDepth overlapOrGap
#> 1          colusa FALSE       TRUE     FALSE         TRUE        FALSE
#> 2           glenn  TRUE      FALSE     FALSE        FALSE        FALSE
#> 3           kings  TRUE      FALSE     FALSE        FALSE        FALSE
#> 4        mariposa  TRUE      FALSE     FALSE        FALSE        FALSE
#> 5       mendocino  TRUE      FALSE     FALSE        FALSE        FALSE
#> 6            napa  TRUE      FALSE     FALSE        FALSE        FALSE
#> 7      san benito  TRUE      FALSE     FALSE        FALSE        FALSE
#> 8          shasta  TRUE      FALSE     FALSE        FALSE        FALSE
#> 9  shasta-trinity  TRUE      FALSE     FALSE        FALSE        FALSE
#> 10         tehama  TRUE      FALSE     FALSE        FALSE        FALSE

# plotSPC 
#  NB: shows the horizons even though they have no top depth
plotSPC(x2)

# inspect
horizons(x2[1,])
#>       id name top bottom   K   Mg  Ca CEC_7 ex_Ca_to_Mg sand silt clay   CF
#> 1 colusa    A   0      3 0.3 25.7 9.0  23.0        0.35   46   33   21 0.12
#> 2 colusa  ABt   3      8 0.2 23.7 5.6  21.4        0.23   42   31   27 0.27
#> 3 colusa  Bt1  NA     42 0.1 23.2 1.9  23.7        0.08   40   28   32 0.27
#> 4 colusa  Bt2  NA     30 0.1 44.3 0.3  43.0        0.01   27   18   55 0.16
#>   hzID
#> 1    1
#> 2    2
#> 3    3
#> 4    4

# do the repair
z2 <- repairMissingHzDepths(x2)
plot(z2)

# result is unaltered -- is this what we expect? / can we _do_ anything w/ NA top depths?
# in the absence of top depth, order cannot be guaranteed
# the most common sources of NA top depths will also have bottom be NA -- e.g. empty NASIS record
horizons(z2[1,])
#>       id name top bottom   K   Mg  Ca CEC_7 ex_Ca_to_Mg sand silt clay   CF
#> 1 colusa    A   0      3 0.3 25.7 9.0  23.0        0.35   46   33   21 0.12
#> 2 colusa  ABt   3      8 0.2 23.7 5.6  21.4        0.23   42   31   27 0.27
#> 3 colusa  Bt1  NA     42 0.1 23.2 1.9  23.7        0.08   40   28   32 0.27
#> 4 colusa  Bt2  NA     30 0.1 44.3 0.3  43.0        0.01   27   18   55 0.16
#>   hzID .repaired
#> 1    1     FALSE
#> 2    2     FALSE
#> 3    3     FALSE
#> 4    4     FALSE

@brownag
Copy link
Member

brownag commented Feb 17, 2021

And one point on the all NA case -- those horizons could not be omitted for "filled" profiles that have no other actual horizon data.

Copy link
Member

@brownag brownag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave it up to you all to decide whether NA top depth effects on sorting and repairing missing depths in general is under the purview of this PR.

@dylanbeaudette
Copy link
Member Author

I leave it up to you all to decide whether NA top depth effects on sorting and repairing missing depths in general is under the purview of this PR.

That is something left to another issue. I think that that the "missing bottom depth of the deepest horizon" case is a very specific example of missing horizon depths that we can actually do something (reasonable) about. Missing top depths is the realm of corrupt data and "fixing" becomes much more involved.

@dylanbeaudette dylanbeaudette merged commit b3b9667 into master Feb 18, 2021
@dylanbeaudette
Copy link
Member Author

And again--thanks for all of the hard work on this one.

brownag added a commit that referenced this pull request Feb 24, 2021
* Use data.table for all j-index subsetting #157

* Significant optimization for j-index subsetting of SPC

* Add more competitive version of base function for j-index

* spc_j_base_2: 100k benchmark ~0.1s faster than original, nearly 3x slower than new DT

* spc_j_base_3: more competitive for small SPCs

* add demo of SPC non-standard eval "k-index" .LAST keyword #199 #201

* add experimental support for .FIRST, .HZID and multiple special keywords

* better logic for i+j+k

* Convert demo to tests

* Docs
@brownag brownag deleted the na-depth-fixes branch March 23, 2021 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants