-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phase-in of data.table support in SPC objects and functions returning SPC objects #157
Comments
I will dump out a bunch of ideas, work that has been done so far, and some of the mysteries I am yet to figure out. Extending and testing the SoilProfileCollection objectMost S4 methods that are fundamental to the SPC class and its basic operations (including subsetting, slot access, joins etc.) now respect the three different classes. It is a bit difficult to draw the conceptual line where "basic operations" stop and where "complicated" operations start... but here is an attempt:
To my knowledge there are three that use the constructor: Methods have the ability to:
There are several methods out there that do not have an method signature to ensure SPC-input-relationship. Also, there are several methods that have an option to return SPCs that do not currently respect these features (return data.frame SPCs no matter what input). The latter methods only have a mechanism to "respect" the aqp_df_class if they receive an SPC as input (i.e. they need the metadata in addition to the data) OR if the class is given as an argument/calculated internally from something else. data.table and tbl_df awarenessMost of the work of data.table and tbl_df inheritance is done automatically by defining their "old classes" using There are two special internal aqp methods for working with slots that may contain data.frame-like classes
.as.data.frame.aqp -- convert a data.frame-like object to target class
.data.frame.j -- subset a data.frame-like object using a vector of column names in the j-index
Working with data.table object j index subsetting in an unambiguous way requires that one rewrite:
as
Anywhere one is using NUMERIC j index should be refactored to use column names (which are much less ambiguous/prone to error). This is generally good practice even for pure data.frames, IMO, and should be easy (index + names attribute). Where appropriate we should consider defining functions that take an SPC object as input as S4 methods with SPC method signature. I also find that methods that return SPCs are really handy for I don't think we should overdo it with wrapper methods, but I also think that the point of S4 is at least in part to dispatch methods for our special class and make sure things are handled safely (and ideally easily for the user). So if there is a major benefit to getting a particular aqp functions' output into SPC, lets define that method it in aqp, rather than leaving the user to figure it out. Wrappers for SPC objectThe gist below shows a basic wrapper around https://gist.github.com/brownag/e57cb66e32ea0842b7e507a18e228906 The wrapper method could be dispatched via S4 (e.g. On the flyHowever, there is a ton of heavy lifting that goes on inside Its not clear how much more effort should be put into retaining historic data.frame implementations where data.table is obviously better or dramatically simplifies the code. I have no good suggestions at this point, but The examples I have done so far have been relatively simple to retain old data.frame functionality -- but Even without internal enhancements, there is value in having the wrapper method from above gist around it to handle merging of result back into SPC for the user and adding horizon/site IDs. KeysFor some operations it is plainly evident that it is worth applying keys -- even on the fly. I think there is nothing wrong with setting key on the fly for operations like I am cautious about more broad handling of keys ... i.e. within SPC methods like E.g. in In contrast, I was very surprised to see how badly the data.table code in min/max performed, and an 8x difference between keyed and unkeyed. I wondered whether I messed something in the DT syntax, making it way more costly -- since Clearly me "doing it wrong" is the most likely explanation.
I still need to experiment a lot more with this -- at this point I was elated to have something that worked let alone be several fold faster across the board. Figuring out the right "data.table" way to do things will be the needed for unlocking the final gains in performance and solving these mysteries. Before I started this I knew next to nothing about |
Forgot to tag this issue in recent commits adding the data.table import https://cran.r-project.org/web/packages/data.table/vignettes/datatable-importing.html |
EDIT: @brownag moved checkboxes to OP so that the progress tracking shows up in the issue |
|
moved checkboxes to OP so that the progress tracking shows up in the issue |
* Use data.table for all j-index subsetting #157 * Significant optimization for j-index subsetting of SPC * Add more competitive version of base function for j-index * spc_j_base_2: 100k benchmark ~0.1s faster than original, nearly 3x slower than new DT * spc_j_base_3: more competitive for small SPCs * add demo of SPC non-standard eval "k-index" .LAST keyword #199 #201 * add experimental support for .FIRST, .HZID and multiple special keywords * better logic for i+j+k * Convert demo to tests * Docs
Keeping track of changes / testing related to
data.table
replacement ofplyr
/reshape
functionality. Each change should be checked and a new test added if there isn't sufficient coverage. See related issue in soilDB and sharpshootR.plyr
join
→base::merge(...,
incomparables = NA, sort = FALSE)
ordata.table
joinSoilProfileCollection-coercion
SoilProfileCollection-slice-methods
(deprecated, in favor ofdice
)munsell2rgb
ddply
SoilProfileCollection-setters
aggregateColor
aggregateSoilDepth
getmlhz
missingDataGrid
llply
profile_compare
ldply
SoilProfileCollection-slice-methods
(deprecated, in favor ofdice
)aggregateColor
soilColorSignature
dlply
aggregateColor
profile_compare
(deprecated, in favor ofNCSP
)reshape
cast
→data.table::dcast
evalGenHz
soilColorSignature
Note that
value
argument becomesvalue.var
.melt
→data.table::melt
plotColorQuantiles
evalGenHz
slab
slab
examples will needreshape
package alternativessoilColorSignature
If the object is not a
data.table
then perform on-the-fly conversion to/from. The following works inplotColorQuantiles
:However, there following appears with a similar conversion in
evalGenHz
:Which is not a problem, and can be suppressed for now. I've added a TODO item for converting all affected columns to
numeric
.Caveats / Notes
data.table::melt
will not preserve factor type / levels unless asked to do so explicitly viavalue.factor
argument. A reasonable solution is to set it after the reshape operation when appropriate.id.vars
should not contain duplicated column nameseval
to dereference variables inside ofdata.table[ , ... ]
I'm fine with the current trajectory as long as we don't (yet) guarantee
class
-in ==class
-out for all functions inaqp
.A Plan
data.frame
,tbl_df
,data.table
end-to-end support for functions using / returningSoilProfileCollection
objectsdata.table
usage (results are stilldata.frame
) to remaining functions where it could give significant performance gain (slice
,slab
,munsell2rgb
, etc.)SoilProfileCollection
objects will respect input table classRelated issues
data.table
re-write ofslice()
#115The text was updated successfully, but these errors were encountered: