Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce path and deprecate hub_con and config_tasks as args to submission_tmpl #196

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

annakrystalli
Copy link
Member

@annakrystalli annakrystalli commented Jan 29, 2025

This PR resolves #165 and in the process also resolves #137

The PR introduces hub_path as main argument to submission_tmpl() and deprecates arguments hub_con and config_tasks (#165 & #137). This way, all that is required by the user to create a submission template is the path to a hub directory.

The change is currently back-compatible as the old arguments still work while issuing a deprecation warning. The arguments will be fully removed in a later version (see #197).

Tests were also updated to work with hub_paths while some tests to ensure current back-compatibility were retained. Note that this change requires a fully configured hub to work, but tests that are based on standalone config files have been accomodated by using local mocking of the read_config() function now used internally to read config files.

@annakrystalli annakrystalli requested a review from zkamvar January 29, 2025 15:27
Copy link

github-actions bot commented Jan 29, 2025

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I think the ergonomics of this function are much better with a single argument. It's a much simpler solution than my suggestion from #137. Your method of using a function to switch out the deprecated arguments like that is a nice touch!

I do have a request for an improvement: allow the first argument to take either a hub or a config file and change the argument name to be path.

My rationale is that it improves testing and modeler ergonomics:

tests that are based on standalone config files have been accommodated by using local mocking of the read_config() function now used internally to read config files.

If we allow for a config file as the first argument as well, then these local mocks would not be necessary.

I've described the modeler ergonomics in my comment below.

R/submission_tmpl.R Outdated Show resolved Hide resolved
@annakrystalli
Copy link
Member Author

Good idea! Not being able to create a template from a config file was bothering me too. I was just trying to keep the interface as simple as possible. This solution is a good way round it.

I've implemented with minor adaptations. Specifically I ensure we check that path exists first before trying to determine whether it's a directory or file. I also use an fs function (which is already an import) as it's name is more self descriptive.

I've also updated the test to remove local bindings, done a bit of reorganising for clarity and added a few more specific tests that were being tested in a more high-level fashion but not clearly indicated as such.

Re:

I often hear from modelers that they do not often work directly with the hub on their machine when creating their model outputs.

This is interesting and makes me wonder, how to they ensure they have access to an up to date version of the config locally?

@annakrystalli annakrystalli requested a review from zkamvar January 31, 2025 15:51
Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates! I have two final suggestions:

  1. update the NEWS item
  2. add a catch so that non-json files throw an informative error

I also added some comments when I was tracing the changes in the snapshot outputs.

To answer the question you asked about modelers:

I often hear from modelers that they do not often work directly with the hub on their machine when creating their model outputs.

This is interesting and makes me wonder, how to they ensure they have access to an up to date version of the config locally?

That's a good question, if they have an internet connection, they could do something like this and get the file since jsonlite::fromJSON() understands URLs

cfg <- "https://raw.githubusercontent.com/reichlab/variant-nowcast-hub/refs/heads/main/hub-config/tasks.json"
submission_tmpl(path = cfg, ...)

NEWS.md Outdated
@@ -1,5 +1,7 @@
# hubValidations (development version)

* Introduced `hub_path` as main argument to `submission_tmpl()` and deprecated arguments `hub_con` and `config_tasks` (#165 & #137). This way, all that is required by the user to create a submission template is the path to a hub directory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Introduced `hub_path` as main argument to `submission_tmpl()` and deprecated arguments `hub_con` and `config_tasks` (#165 & #137). This way, all that is required by the user to create a submission template is the path to a hub directory.
* Introduced `path` as main argument to `submission_tmpl()` and deprecated arguments `hub_con` and `config_tasks` (#165 & #137). This way, all that is required by the user to create a submission template is the path to a hub directory or the config file.

R/submission_tmpl.R Outdated Show resolved Hide resolved
Comment on lines -166 to +183
7 2022-12-26 wk ahead inc~ 2 US sample 1 NA
8 2022-12-26 wk ahead inc~ 2 01 sample 1 NA
9 2022-12-26 wk ahead inc~ 2 02 sample 1 NA
10 2022-12-26 wk ahead inc~ 1 US sample 2 NA
7 2022-12-26 wk ahead inc~ 2 US sample s1 NA
8 2022-12-26 wk ahead inc~ 1 US sample s2 NA
9 2022-12-26 wk ahead inc~ 2 01 sample s3 NA
10 2022-12-26 wk ahead inc~ 1 01 sample s4 NA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially confused why this section of the output had changed, but I now realize that it's because this snapshot test was originally in the section above where config_tasks was from a hub with a compound task ID structure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because the checks have been re-organised, some removed hence some new snapshots have been made

Comment on lines -204 to +250
8 2022-12-26 wk ahead inc~ 1 US sample 1 NA
9 2022-12-26 wk ahead inc~ 2 01 sample 1 NA
10 2022-12-26 wk ahead inc~ 1 01 sample 1 NA
8 2022-12-26 wk ahead inc~ 2 01 sample 1 NA
9 2022-12-26 wk ahead inc~ 2 02 sample 1 NA
10 2022-12-26 wk ahead inc~ 1 US sample 2 NA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as above, I believe this is an artifact of this check belonging to an old value for the config.

Comment on lines +306 to +308
8 2022-12-26 wk ahead inc~ 1 US sample 1 NA
9 2022-12-26 wk ahead inc~ 2 01 sample 1 NA
10 2022-12-26 wk ahead inc~ 1 01 sample 1 NA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as above

Comment on lines +327 to +329
8 2022-12-26 wk ahead inc~ 1 US sample 2 NA
9 2022-12-26 wk ahead inc~ 2 01 sample 3 NA
10 2022-12-26 wk ahead inc~ 1 01 sample 4 NA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the end of the modifications to the expected outputs

@annakrystalli annakrystalli changed the title Introduce hub_path and deprecate hub_con and config_tasks as args to submission_tmpl Introduce path and deprecate hub_con and config_tasks as args to submission_tmpl Feb 4, 2025
@annakrystalli
Copy link
Member Author

That's a good question, if they have an internet connection, they could do something like this and get the file since jsonlite::fromJSON() understands URLs

cfg <- "https://raw.githubusercontent.com/reichlab/variant-nowcast-hub/refs/heads/main/hub-config/tasks.json"
submission_tmpl(path = cfg, ...)

This won't work unfortunately with the current setup:

library(hubValidations)
path <- "https://raw.githubusercontent.com/reichlab/variant-nowcast-hub/refs/heads/main/hub-config/tasks.json"
submission_tmpl(
  path,
  round_id = "2025-01-25"
  )
#> Error in `submission_tmpl()`:
#> ! `path`
#>   'https://raw.githubusercontent.com/reichlab/variant-nowcast-hub/refs/heads/main/hub-config/tasks.json'
#>   does not exist.

Created on 2025-02-04 with reprex v2.1.0

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work unfortunately with the current setup:

Oh yeah, that's a good point. The fs::file_exists() catches that and it doesn't know about URLs :/

@annakrystalli
Copy link
Member Author

Just a heads up @zkamvar

Given the changes proposed in hubverse-org/hubUtils#210 which will solve a couple of the outstanding issues here, I'm going to wait till we merge that, then adapt and re-request review. 🚀

@annakrystalli annakrystalli requested a review from zkamvar February 6, 2025 14:30
@annakrystalli
Copy link
Member Author

Ready for review @zkamvar !

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I have some requested changes for the news and to simplify the control structure for the path argument.

NEWS.md Show resolved Hide resolved
Comment on lines 331 to 353
if (inherits(path, "SubTreeFileSystem")) {
if (hubUtils::is_s3_base_fs(path)) {
return(read_config(path))
}
return(read_config_file(path))
}
# Read config file from URL
if (hubUtils::is_url(path)) {
if (hubUtils::is_github_repo_url(path)) {
return(read_config(path))
}
return(read_config_file(path))
}
# Read local config
if (!fs::file_exists(path)) {
cli::cli_abort("{.arg path} {.file {path}} does not exist.",
call = rlang::caller_env(1)
)
}
if (fs::is_dir(path)) {
return(read_config(path))
}
read_config_file(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since much of the heavy lifting is done inside read_config() and read_config_file(), the main task is to determine if we have a directory or a file. This can be achieved with two indicator variables.

Suggested change
if (inherits(path, "SubTreeFileSystem")) {
if (hubUtils::is_s3_base_fs(path)) {
return(read_config(path))
}
return(read_config_file(path))
}
# Read config file from URL
if (hubUtils::is_url(path)) {
if (hubUtils::is_github_repo_url(path)) {
return(read_config(path))
}
return(read_config_file(path))
}
# Read local config
if (!fs::file_exists(path)) {
cli::cli_abort("{.arg path} {.file {path}} does not exist.",
call = rlang::caller_env(1)
)
}
if (fs::is_dir(path)) {
return(read_config(path))
}
read_config_file(path)
is_s3_dir <- inherits(path, "SubTreeFileSystem") && hubUtils::is_s3_base_fs(path)
is_local_dir <- is.character(path) && fs::path_ext(path) == ""
if (is_s3_dir || is_local_dir) {
read_config(path)
} else {
read_config_file(path)
}

Copy link
Member Author

@annakrystalli annakrystalli Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❗ This will not pass tests until hubverse-org/hubUtils#211 is merged

Nice simplification! There are however exceptions that this wasn't handling well. For example, if someone erroneously supplied "https://github.com/hubverse-org/schemas/tree/main/v5.0.0" as a path, it was identified as a file and read_config_file() was invoked. That's why I added an additional check which catches that and returns a more informative error message.

is.character(path) && fs::path_ext(path) == "" is still a bit hacky as if a URL like "https://raw.githubusercontent.com/hubverse-org/example-simple-forecast-hub/refs/heads/main/README" or "https://raw.githubusercontent.com/hubverse-org/example-simple-forecast-hub/refs/heads/main/LICENSE" is supplied, it would trigger read_config() erroneously. Unfortunately the only method I found for ensuring the correct read function is deployed requires making a request for headers and checking the Content-Type to see if it’s a file. I've refrained from adding such a function but have added a test for that exception. It would be nice to handle better at some point (now it just fails because it interprets the URL as a folder and creates a path to tasks.json that does not exists).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification! There are however exceptions that this wasn't handling well. For example, if someone erroneously supplied "https://github.com/hubverse-org/schemas/tree/main/v5.0.0" as a path, it was identified as a file and read_config_file() was invoked. That's why I added an additional check which catches that and returns a more informative error message.

is.character(path) && fs::path_ext(path) == "" is still a bit hacky as if a URL like "https://raw.githubusercontent.com/hubverse-org/example-simple-forecast-hub/refs/heads/main/README" or "https://raw.githubusercontent.com/hubverse-org/example-simple-forecast-hub/refs/heads/main/LICENSE" is supplied, it would trigger read_config() erroneously.

These are fair points and I definitely agree that the directory check is hacky, but I take solace in the hope that these situations are vanishingly rare and are easily fixed.

Copy link
Member

@zkamvar zkamvar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Reviewed/Ready to Merge
2 participants