Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ship released schemas with hubUtils #194

Merged
merged 31 commits into from
Jan 13, 2025
Merged

Ship released schemas with hubUtils #194

merged 31 commits into from
Jan 13, 2025

Conversation

zkamvar
Copy link
Member

@zkamvar zkamvar commented Nov 20, 2024

This PR enables offline schema validation by shipping the schemas directly with hubUtils.

  1. I've added a script that will download all of the released schemas and update them in the repository in data-raw/schemas.R. Schemas will live in inst/schemas/ and there is a configuration file called inst/schemas/update.json that records the source of the installed version of the schemas (see next section).
  2. I've added a catch to get_schema_valid_versions() and get_schema() that will check if the requested branch is "main" and then return the locally shipped schemas if it is, otherwise they will fall back to checking from GitHub
  3. In the case a user is working from an old version of hubUtils and the requested schema is not available for get_schema() (e.g. they attempt to validate with a newer hub), the relevant schema will be downloaded from GitHub.

This will fix #193

New Development Process

As I mentioned in point 1, data-raw/schemas.R will synchronize the schemas in this repo with the schemas repo. Based on #194 (comment), I've updated the development process so that we will no longer have to go through something like hubverse-org/hubAdmin#71.

Details are found in the CONTRIBUTING document, but in short, the steps for developing against a new schema version are:

One time step (ever)

Add the script as a git pre-push hook:

library(devtools)
use_git_hook("pre-push", proj_path("data-raw/schemas.R"))

With the git push hook installed, it will run every time you push to keep your schemas are up-to-date with the branch defined in inst/schemas/update.json.

This will live locally for this git repository and you can remove it if you do not like it by using unlink(devtools::proj_path(".git/hooks/pre-push"))

Once per new schema version

When you start work on a new dev version of the schema, you set the default branch. You can do this by either modifying inst/schemas/update.json OR you can use the HUBUTILS_SCHEMA_BRANCH environment variable (which will update that file for you). Once the branch is set in inst/schemas/update.json, the script will always check for updates on that branch.

Sys.setenv("HUBUTILS_SCHEMA_BRANCH" = "br-v4.0.1")
source("data-raw/schemas.R")

Note that the script will check against the main branch to ensure that the new branch is newer than the main branch. If this is not true, you will get an error.

After new schema release

After the new schema is released, before you can release this package, you must ensure the default branch is set back to "main", which is done in the same way you would set the
development branch.

Sys.setenv("HUBUTILS_SCHEMA_BRANCH" = "main")
source("data-raw/schemas.R")

Actual changes

Note that the 16k changes are mostly due to the schemas. Here's the actual list of content changes:

$ git diff --stat main..HEAD -- R tests data-raw .github inst/schemas/update.json
 .github/CONTRIBUTING.md                   | 226 +++++++++++++++++++++++++++++++++++++++++++
 R/read_config.R                           |   8 +-
 R/utils-get_hub.R                         |   4 +-
 R/utils-model_out_tbl.R                   |  43 +++++++++
 R/utils-schema.R                          |  41 ++++++++
 R/utils-task_ids.R                        |  17 ++++
 data-raw/schemas.R                        | 228 ++++++++++++++++++++++++++++++++++++++++++++
 inst/schemas/update.json                  |   5 +
 tests/testthat/_snaps/read_config.md      |   4 +-
 tests/testthat/test-read_config.R         |   6 +-
 tests/testthat/test-utils-get_hub.R       |   6 +-
 tests/testthat/test-utils-model_out_tbl.R |  57 +++++++++++
 tests/testthat/test-utils-schema.R        |  29 ++++++
 tests/testthat/test-utils-task_ids.R      |  19 ++++
 14 files changed, 678 insertions(+), 15 deletions(-)

This will fall back to github if versions are not found.
Copy link

github-actions bot commented Nov 20, 2024

@annakrystalli
Copy link
Member

annakrystalli commented Dec 13, 2024

sorry I haven't got round to this! It's been on my to do list and promise to have a look on Monday. I have a suggestion on how this might work but probably won't get time to get to it today.

@zkamvar
Copy link
Member Author

zkamvar commented Dec 13, 2024

sorry I haven't got round to this! It's been on my to do list and promise to have a look on Monday. I have a suggestion on how this might work but probably won't get time to get to it today.

No worries! It's not an urgent thing and it would definitely benefit from careful review.

That being said, I will likely be working on the dashboard project next week, so a quick turnaround is not necessary.

@annakrystalli
Copy link
Member

Before doing a detailed review, I wanted to run some higher level thoughts passed you.

This is a great improvement but I think we could push the utility even further.

One of the biggest difficulties throughout hubverse packages is developing against an under development schema. This would still require internet access to access schema from a branch during development and testing (hence we'd still need to add skips for not online in tests and use set tests against dev branches) and a lot of post release cleanup of tests after a schema release.

Given the GitHub version of packages is considered development, I propose we also allow the next under development schema version to be able to be included in hubUtils. This would make such versions available to the rest of the hubverse and hence development of other packages would become SO much easier and remove the requirement for post release clean up of schema related tests downstream.

To enable this I would like to modify the script to download the schema and turn it into a function that can take a branch argument as an input.

For traceability, I also think inclusion of an additional file (yml?) in the inst folder that records:

  • The branch the latest version was updated from
  • The schemas repo commit hash of the latest version
  • A timestamp of the last update

Down the line we could perhaps use such info and webhooks to automatically open PRs to update the latest version? Not needed to start off with though but good documentation for developers will be required instead.

Whatever we decide I feel we should add a section in the schemas README and also create a more detailed custom CONTRIBUTING.md here that details the workflow we land on.

.github/CONTRIBUTING.md Outdated Show resolved Hide resolved
@zkamvar
Copy link
Member Author

zkamvar commented Dec 19, 2024

To enable this I would like to modify the script to download the schema and turn it into a function that can take a branch argument as an input.

For traceability, I also think inclusion of an additional file (yml?) in the inst folder that records:

  • The branch the latest version was updated from
  • The schemas repo commit hash of the latest version
  • A timestamp of the last update

I think this is a good idea. I did not turn it into a function, however because doing so would require us to put "usethis" as one of the Suggested packages. Since this is code that is only meant for the developers to synchronize data, I kept it in data-raw/ and have added instructions to run it via R, bash, and/or git hook.

I have implemented the additional file that records the branch, sha, and timestamp as a JSON file because we already have the tools to parse and write JSON.

The ability to store the in-development version in this package also means that the branch argument is superseded, but it's still a good fallback if someone really needs an in-development schema with the released version of the package.

@zkamvar zkamvar requested a review from annakrystalli January 6, 2025 16:02
@zkamvar zkamvar added the upkeep maintenance, infrastructure, and similar label Jan 8, 2025
Copy link
Member

@annakrystalli annakrystalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

Overall I've approved but I have made a number of suggestions, the most important one being a change in the name of the environment variable used to specify the branch to install from.

I also added a few suggestions and some requests for clarifications. I don;t feel the suggestions are blockers and could be filed as separate issues to tackle in the near future instead.

I do feel we should add another bullet to the release instruction in the schemas repo to make sure to update hubUtils with the released version of the schema and make a release of hubUtils also.

.github/CONTRIBUTING.md Outdated Show resolved Hide resolved
.github/CONTRIBUTING.md Outdated Show resolved Hide resolved
.github/CONTRIBUTING.md Show resolved Hide resolved
.github/CONTRIBUTING.md Outdated Show resolved Hide resolved
.github/CONTRIBUTING.md Outdated Show resolved Hide resolved
data-raw/schemas.R Outdated Show resolved Hide resolved
data-raw/schemas.R Show resolved Hide resolved
format(Sys.time(), "%Y-%m-%dT%H:%M:%SZ", tz = "UTC")
}

get_latest_commit <- function(branch) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to also always check the main branch and to also flag if there is a later commit in the main branch than the latest one in a branch? This could help ensure versions we don't forget to update hubUtils when a version is released?

This also makes me think we should add another bullet to the release instruction in the schemas repo to make sure to update hubUtils with the released version of the schema and make a release of hubUtils also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to also always check the main branch and to also flag if there is a later commit in the main branch than the latest one in a branch? This could help ensure versions we don't forget to update hubUtils when a version is released?

I think this is reasonable.

This also makes me think we should add another bullet to the release instruction in the schemas repo to make sure to update hubUtils with the released version of the schema and make a release of hubUtils also.

Definitely an issue to open on the schemas repo after this is done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 0e51294

data-raw/schemas.R Outdated Show resolved Hide resolved
data-raw/schemas.R Outdated Show resolved Hide resolved
@zkamvar zkamvar merged commit 8af0884 into main Jan 13, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upkeep maintenance, infrastructure, and similar
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Deliver schemas with the hubUtils package instead of via GitHub
2 participants