Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the down_scale function to pecan. #3211

Merged
merged 15 commits into from
Mar 7, 2024

Conversation

JoshuaPloshay
Copy link
Collaborator

@JoshuaPloshay JoshuaPloshay commented Aug 10, 2023

Description

Motivation and Context

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

##' @author Joshua Ploshay
##'
##' @param data In quotes, file path for .rds containing ensemble data.
##' @param focus_year In quotes, if SDA site run, format is yyyy/mm/dd, if NEON, yyyy-mm-dd. Restricted to years within file supplied to 'data'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have this passed as a Data rather than a string? Also, if you have to provide month and day (which makes sense given potential future applications) then why is the variable called focus_year instead of focus_date or just date? Also, if you are working with annual data do you need to know the exact date in the product or just the year (e.g. 2020-01-01 vs 2020-07-31)

##' @param data In quotes, file path for .rds containing ensemble data.
##' @param focus_year In quotes, if SDA site run, format is yyyy/mm/dd, if NEON, yyyy-mm-dd. Restricted to years within file supplied to 'data'.
##' @param C_pool In quotes, carbon pool of interest. Name must match carbon pool name found within file supplied to 'data'.
##' @param covariates: In quotes, file path of SpatRaster stack, used as predictors in randomForest. Layers within stack should be named.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the scripts for downloading these layers and processing them into a stack? Those should be part of the repo too (e.g. in the inst folder), and thus part of the PR, and your function documentation should point users to them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if you organized all the covariates into a single stack, why not have the user pass in the stack rather than a file path?

modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
##' @return It returns the `downscale_output` list containing lists for the training and testing data sets, models, and predicted maps for each ensemble member.


NA_downscale <- function(data, cords, covariates, focus_year, C_pool){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally best to document variables in the order they are used in the function


# Extract the carbon data for the specified focus year
index <- which(names(input_data) == focus_year)
data <- input_data[[index]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poor choice of variable names. You earlier defined data as the file path to the SDA object. Now, you're redefining it as a temporal subset of that SDA object. One of these two uses needs to change.

# Rename the training and testing data frames for each ensemble member
for (i in 1:length(ensembles)) {
# names(ensembles) <- paste0("ensemble",seq(1:length(ensembles)))
names(ensembles[[i]]) <- c("training", "testing")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you should have been able to do this during the split rather than after.

output <- list()
for (i in 1:length(ensembles)) {
output[[i]] <- randomForest::randomForest(ensembles[[i]][[1]][["carbon_data"]] ~ land_cover+tavg+prec+srad+vapr+nitrogen+phh2o+soc+sand,
data = ensembles[[i]][[1]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this line and above, what is the second dimension of ensembles and why is it being hard coded to [[1]]? Could you reference this by name instead (e.g. ensembles[[i]][["train"]])? Also, if you are specifying the data argument, why is the y in the randomForest model ensembles[[i]][[1]][["carbon_data"]] instead of just carbon_data

importance = T)
}

# Generate predictions (maps) for each ensemble member using the trained models
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like using the 'test' data to validate the models should come before making spatial predictions


# Train a random forest model for each ensemble member using the training data
output <- list()
for (i in 1:length(ensembles)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how long this step takes, you might consider doing this step in parallel


# Generate predictions (maps) for each ensemble member using the trained models
maps <- list(ncol(output))
for (i in 1:length(output)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how long this step takes, you might consider doing this step in parallel

@mdietze
Copy link
Member

mdietze commented Aug 11, 2023

Build is failing because of the following:

R check of modules/assim.sequential reports the following new problems. Please fix these and resubmit:
  checking dependencies in R code ... WARNING
  '::' or ':::' imports not declared from:
    ‘randomForest’ ‘readr’ ‘terra’

I think you should modify the code to eliminate the readr dependency and then add randomForest and terra under "Suggests" in the assim.sequential description file. You will probably also need to run scripts/generate_dependencies.R and commit the updated docker/depends/pecan.depends.R

@mdietze
Copy link
Member

mdietze commented Sep 3, 2023

@JoshuaPloshay just wanted to re-ping you on this before your semester gets busy

modules/assim.sequential/R/downscale_function.R Outdated Show resolved Hide resolved
modules/assim.sequential/man/NA_downscale.Rd Outdated Show resolved Hide resolved
modules/assim.sequential/man/NA_downscale.Rd Outdated Show resolved Hide resolved
@mdietze mdietze added this pull request to the merge queue Mar 7, 2024
Merged via the queue into PecanProject:develop with commit 4564337 Mar 7, 2024
12 checks passed
@sambhavnoobcoder
Copy link
Contributor

hi @JoshuaPloshay , I was looking into this pr and was wondering if you could share where and how the data and cords variables are being generated or fetched from ? I wanted to run the downscale_function.R but can't understand the source of the variables and parameters used in the function .

@mdietze
Copy link
Member

mdietze commented May 21, 2024

@sambhavnoobcoder as we discussed, the covariates come from the covariates.R file, which is in the open PR #3272 What I'd asked you to do wasn't to post a vague comment here, but to go to the open PR and comment on the specific lines in covariates.R that are unclear. A number of the covariates are already well documented there and it's not fair to ask open ended questions here without first reading that code. We've also discussed from the very start of the project discussion that the initial test data should be the output from the Dokoohaki et al 2022 paper, which are posted on OSF: https://osf.io/efcv5/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants