-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the down_scale function to pecan. #3211
Conversation
##' @author Joshua Ploshay | ||
##' | ||
##' @param data In quotes, file path for .rds containing ensemble data. | ||
##' @param focus_year In quotes, if SDA site run, format is yyyy/mm/dd, if NEON, yyyy-mm-dd. Restricted to years within file supplied to 'data'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have this passed as a Data rather than a string? Also, if you have to provide month and day (which makes sense given potential future applications) then why is the variable called focus_year instead of focus_date or just date? Also, if you are working with annual data do you need to know the exact date in the product or just the year (e.g. 2020-01-01 vs 2020-07-31)
##' @param data In quotes, file path for .rds containing ensemble data. | ||
##' @param focus_year In quotes, if SDA site run, format is yyyy/mm/dd, if NEON, yyyy-mm-dd. Restricted to years within file supplied to 'data'. | ||
##' @param C_pool In quotes, carbon pool of interest. Name must match carbon pool name found within file supplied to 'data'. | ||
##' @param covariates: In quotes, file path of SpatRaster stack, used as predictors in randomForest. Layers within stack should be named. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are the scripts for downloading these layers and processing them into a stack? Those should be part of the repo too (e.g. in the inst folder), and thus part of the PR, and your function documentation should point users to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if you organized all the covariates into a single stack, why not have the user pass in the stack rather than a file path?
##' @return It returns the `downscale_output` list containing lists for the training and testing data sets, models, and predicted maps for each ensemble member. | ||
|
||
|
||
NA_downscale <- function(data, cords, covariates, focus_year, C_pool){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally best to document variables in the order they are used in the function
|
||
# Extract the carbon data for the specified focus year | ||
index <- which(names(input_data) == focus_year) | ||
data <- input_data[[index]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poor choice of variable names. You earlier defined data
as the file path to the SDA object. Now, you're redefining it as a temporal subset of that SDA object. One of these two uses needs to change.
# Rename the training and testing data frames for each ensemble member | ||
for (i in 1:length(ensembles)) { | ||
# names(ensembles) <- paste0("ensemble",seq(1:length(ensembles))) | ||
names(ensembles[[i]]) <- c("training", "testing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you should have been able to do this during the split rather than after.
output <- list() | ||
for (i in 1:length(ensembles)) { | ||
output[[i]] <- randomForest::randomForest(ensembles[[i]][[1]][["carbon_data"]] ~ land_cover+tavg+prec+srad+vapr+nitrogen+phh2o+soc+sand, | ||
data = ensembles[[i]][[1]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this line and above, what is the second dimension of ensembles and why is it being hard coded to [[1]]? Could you reference this by name instead (e.g. ensembles[[i]][["train"]])? Also, if you are specifying the data
argument, why is the y in the randomForest model ensembles[[i]][[1]][["carbon_data"]]
instead of just carbon_data
importance = T) | ||
} | ||
|
||
# Generate predictions (maps) for each ensemble member using the trained models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like using the 'test' data to validate the models should come before making spatial predictions
|
||
# Train a random forest model for each ensemble member using the training data | ||
output <- list() | ||
for (i in 1:length(ensembles)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how long this step takes, you might consider doing this step in parallel
|
||
# Generate predictions (maps) for each ensemble member using the trained models | ||
maps <- list(ncol(output)) | ||
for (i in 1:length(output)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how long this step takes, you might consider doing this step in parallel
Build is failing because of the following:
I think you should modify the code to eliminate the readr dependency and then add randomForest and terra under "Suggests" in the assim.sequential description file. You will probably also need to run scripts/generate_dependencies.R and commit the updated docker/depends/pecan.depends.R |
@JoshuaPloshay just wanted to re-ping you on this before your semester gets busy |
Merge branch 'develop' of https://github.com/PecanProject/pecan into develop # Conflicts: # docker/depends/pecan.depends.R
hi @JoshuaPloshay , I was looking into this pr and was wondering if you could share where and how the data and cords variables are being generated or fetched from ? I wanted to run the downscale_function.R but can't understand the source of the variables and parameters used in the function . |
@sambhavnoobcoder as we discussed, the covariates come from the covariates.R file, which is in the open PR #3272 What I'd asked you to do wasn't to post a vague comment here, but to go to the open PR and comment on the specific lines in covariates.R that are unclear. A number of the covariates are already well documented there and it's not fair to ask open ended questions here without first reading that code. We've also discussed from the very start of the project discussion that the initial test data should be the output from the Dokoohaki et al 2022 paper, which are posted on OSF: https://osf.io/efcv5/ |
Description
Motivation and Context
Review Time Estimate
Types of changes
Checklist: