diff --git a/.Rbuildignore b/.Rbuildignore index b029964..3cd6e2c 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -14,6 +14,7 @@ ^\.manifest\.json$ ^data/.*$ ^\.Renviron$ +^\.Renviron-example$ ^appveyor\.yml$ ^CODE_OF_CONDUCT\.md$ ^onboarding-submission\.md$ diff --git a/CODE_OF_CONDUCT.md b/.github/CODE_OF_CONDUCT.md similarity index 100% rename from CODE_OF_CONDUCT.md rename to .github/CODE_OF_CONDUCT.md diff --git a/.gitignore b/.gitignore index 83e9b44..4a03414 100644 --- a/.gitignore +++ b/.gitignore @@ -13,3 +13,5 @@ inst/doc *.bz2 data/* .DS_Store +CRAN_SUBMISSION +CRAN_RELEASE diff --git a/CRAN-RELEASE b/CRAN-RELEASE deleted file mode 100644 index 69e6839..0000000 --- a/CRAN-RELEASE +++ /dev/null @@ -1,2 +0,0 @@ -This package was submitted to CRAN on 2021-08-05. -Once it is accepted, delete this file and tag the release (commit 257610d). diff --git a/CRAN-SUBMISSION b/CRAN-SUBMISSION deleted file mode 100644 index 08640f9..0000000 --- a/CRAN-SUBMISSION +++ /dev/null @@ -1,3 +0,0 @@ -Version: 0.1.5 -Date: 2023-07-10 00:41:17 UTC -SHA: a94ec4b5c91f4883d03685651555e1d05d5e1a87 diff --git a/DESCRIPTION b/DESCRIPTION index e40cf5d..da26aa9 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -44,6 +44,8 @@ Imports: rlang Suggests: spelling, + duckdbfs, + duckdb, readr, covr, testthat, diff --git a/README.Rmd b/README.Rmd index 2c77b09..d9dce35 100644 --- a/README.Rmd +++ b/README.Rmd @@ -29,74 +29,92 @@ knitr::opts_chunk$set( [![DOI](http://joss.theoj.org/papers/10.21105/joss.00971/status.svg)](https://doi.org/10.21105/joss.00971) +`{piggyback}` provides an R interface for storing files as GitHub release assets, +which is a convenient way for large/binary data files to _piggyback_ onto public +and private GitHub repositories. This package includes functions for file downloads, +uploads, and managing releases, which then are passed to the GitHub API. -Because larger (> 50 MB) data files cannot easily be committed to git, a different approach is required to manage data associated with an analysis in a GitHub repository. This package provides a simple work-around by allowing larger ([up to 2 GB per file](https://docs.github.com/en/github/managing-large-files/distributing-large-binaries)) data files to piggyback on a repository as assets attached to individual GitHub releases. These files are not handled by git in any way, but instead are uploaded, downloaded, or edited directly by calls through the GitHub API. These data files can be versioned manually by creating different releases. This approach works equally well with public or private repositories. Data can be uploaded and downloaded programmatically from scripts. No authentication is required to download data from public repositories. - - - +No authentication is required to download data from public repositories. ## Installation - - -Install from CRAN via - -``` r +Install from CRAN via: +```r install.packages("piggyback") ``` - -You can install the development version from [GitHub](https://github.com/ropensci/piggyback) with: - -``` r -# install.packages("devtools") -devtools::install_github("ropensci/piggyback") +You can install the development version from [GitHub](https://github.com/ropensci/piggyback) +with either r-universe or with remotes: +```r +install.packages("piggyback", repos = c('https://ropensci.r-universe.dev', getOption("repos"))) +# install.packages("remotes") +remotes::install_github("ropensci/piggyback") ``` +## Usage +See [getting started vignette](https://docs.ropensci.org/piggyback/articles/intro.html) +for a more comprehensive introduction. -## Quickstart - -See the [piggyback vignette](https://docs.ropensci.org/piggyback/articles/intro.html) for details on authentication and additional package functionality. - -Piggyback can download data attached to a release on any repository: - -```{r results="hide"} +Download data attached to a GitHub release: +```r library(piggyback) -pb_download("iris.tsv.gz", repo = "cboettig/piggyback-tests", dest = tempdir()) +pb_download("iris2.tsv.gz", + repo = "cboettig/piggyback-tests", + tag = "v0.0.1", + dest = tempdir()) +#> ℹ Downloading "iris2.tsv.gz"... +#> |======================================================| 100% +fs::dir_tree(tempdir()) +#> /tmp/RtmpWxJSZj +#> └── iris2.tsv.gz ``` - - -Downloading from private repos or uploading to any repo requires authentication, so be sure to set a `GITHUB_TOKEN` (or `GITHUB_PAT`) environmental variable, or include the `.token` argument. Omit the file name to download all attached objects. Omit the repository name to default to the current repository. See [introductory vignette](https://docs.ropensci.org/piggyback/articles/intro.html) or function documentation for details. - -We can also upload data to any existing release (defaults to `latest`): - -```{r eval=FALSE} -## We'll need some example data first. -## Pro tip: compress your tabular data to save space & speed upload/downloads +Downloading from private repos or uploading to any repo requires authentication, +specifically a GitHub Personal Access Token (PAT). This can be stored as a +[gh::gh_token()](https://usethis.r-lib.org/articles/git-credentials.html#get-a-personal-access-token-pat) +or a GITHUB_PAT environment variable - for more information, see the vignette notes on +[authentication](https://docs.ropensci.org/piggyback/articles/piggyback.html#authentication). + +We can also upload data to a release. Start by creating a release: +```r +pb_release_create(repo = "cboettig/piggyback-tests", tag = "v0.0.2") +#> ✔ Created new release "v0.0.2". +``` +then upload to it: +```r readr::write_tsv(mtcars, "mtcars.tsv.gz") - pb_upload("mtcars.tsv.gz", repo = "cboettig/piggyback-tests") +#> ℹ Uploading to latest release: "v0.0.2". +#> ℹ Uploading mtcars.tsv.gz ... +#> |===================================================| 100% ``` -## Git LFS and other alternatives - -`piggyback` acts like a poor soul's [Git LFS](https://git-lfs.com/). Git LFS is not only expensive, it also [breaks GitHub's collaborative model](https://angryfrenchman.org/github-s-large-file-storage-is-no-panacea-for-open-source-quite-the-opposite-12c0e16a9a91) -- basically if someone wants to submit a PR with a simple edit to your docs, they cannot fork your repository since that would otherwise count against your Git LFS storage. Unlike Git LFS, `piggyback` doesn't take over your standard `git` client, it just perches comfortably on the shoulders of your existing GitHub API. Data can be versioned by `piggyback`, but relative to `git LFS` versioning is less strict: uploads can be set as a new version or allowed to overwrite previously uploaded data. +For improved performance, we can also use piggyback files with +[cloud native](https://docs.ropensci.org/piggyback/articles/cloud_native.html) +workflows to query data without downloading it first. -## But what will GitHub think of this? +## Motivations -[GitHub documentation](https://docs.github.com/en/github/managing-large-files/distributing-large-binaries) at the time of writing endorses the use of attachments to releases as a solution for distributing large files as part of your project: +A brief video overview presented as part of Tan Ho's [RStudioConf2022 talk](https://www.youtube.com/watch?v=wzcz4xNGeTI&t=655s): -![](man/figures/github-policy.png) +https://github.com/ropensci/piggyback/assets/38083823/a1dff640-1bba-4c06-bad2-feda34f47387 +`piggyback` allows you to store data alongside your repository as release assets, +which helps you: -Of course, it will be up to GitHub to decide if this use of release attachments is acceptable in the long term. +- store files larger than 50MB +- bypass the 2GB GitHub repo size limit + +- avoid the [downsides](https://archive.is/3D16r) of Git LFS +- version data flexibly (by creating/uploading to a new release) +- work with public and private repositories, **for free** - +For more about motivations, see this discussion of +[alternatives](https://docs.ropensci.org/piggyback/articles/alternatives.html). -Also see our [vignette comparing alternatives](https://docs.ropensci.org/piggyback/articles/alternatives.html). +## Contributing ----- - -Please note that this project is released with a [Contributor Code of Conduct](https://ropensci.org/code-of-conduct/). +Please note that this project is released with a +[Contributor Code of Conduct](https://ropensci.org/code-of-conduct/). By participating in this project you agree to abide by its terms. ```{r include=FALSE} @@ -104,5 +122,4 @@ unlink("*.gz") codemeta::write_codemeta() ``` - [![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org) diff --git a/README.md b/README.md index af9bb79..b7ccec5 100644 --- a/README.md +++ b/README.md @@ -17,107 +17,108 @@ Status](https://badges.ropensci.org/220_status.svg)](https://github.com/ropensci [![DOI](http://joss.theoj.org/papers/10.21105/joss.00971/status.svg)](https://doi.org/10.21105/joss.00971) -Because larger (> 50 MB) data files cannot easily be committed to -git, a different approach is required to manage data associated with an -analysis in a GitHub repository. This package provides a simple -work-around by allowing larger ([up to 2 GB per -file](https://docs.github.com/en/github/managing-large-files/distributing-large-binaries)) -data files to piggyback on a repository as assets attached to individual -GitHub releases. These files are not handled by git in any way, but -instead are uploaded, downloaded, or edited directly by calls through -the GitHub API. These data files can be versioned manually by creating -different releases. This approach works equally well with public or -private repositories. Data can be uploaded and downloaded -programmatically from scripts. No authentication is required to download -data from public repositories. +`{piggyback}` provides an R interface for storing files as GitHub +release assets, which is a convenient way for large/binary data files to +*piggyback* onto public and private GitHub repositories. This package +includes functions for file downloads, uploads, and managing releases, +which then are passed to the GitHub API. + +No authentication is required to download data from public repositories. ## Installation -Install from CRAN via +Install from CRAN via: ``` r install.packages("piggyback") ``` You can install the development version from -[GitHub](https://github.com/ropensci/piggyback) with: +[GitHub](https://github.com/ropensci/piggyback) with either r-universe +or with remotes: ``` r -# install.packages("devtools") -devtools::install_github("ropensci/piggyback") +install.packages("piggyback", repos = c('https://ropensci.r-universe.dev', getOption("repos"))) +# install.packages("remotes") +remotes::install_github("ropensci/piggyback") ``` -## Quickstart +## Usage -See the [piggyback -vignette](https://docs.ropensci.org/piggyback/articles/intro.html) for -details on authentication and additional package functionality. +See [getting started +vignette](https://docs.ropensci.org/piggyback/articles/intro.html) for a +more comprehensive introduction. -Piggyback can download data attached to a release on any repository: +Download data attached to a GitHub release: ``` r library(piggyback) -pb_download("iris.tsv.gz", repo = "cboettig/piggyback-tests", dest = tempdir()) -#> Warning in pb_download("iris.tsv.gz", repo = "cboettig/piggyback-tests", : -#> file(s) iris.tsv.gz not found in repo cboettig/piggyback-tests +pb_download("iris2.tsv.gz", + repo = "cboettig/piggyback-tests", + tag = "v0.0.1", + dest = tempdir()) +#> ℹ Downloading "iris2.tsv.gz"... +#> |======================================================| 100% +fs::dir_tree(tempdir()) +#> /tmp/RtmpWxJSZj +#> └── iris2.tsv.gz ``` Downloading from private repos or uploading to any repo requires -authentication, so be sure to set a `GITHUB_TOKEN` (or `GITHUB_PAT`) -environmental variable, or include the `.token` argument. Omit the file -name to download all attached objects. Omit the repository name to -default to the current repository. See [introductory -vignette](https://docs.ropensci.org/piggyback/articles/intro.html) or -function documentation for details. +authentication, specifically a GitHub Personal Access Token (PAT). This +can be stored as a +[gh::gh_token()](https://usethis.r-lib.org/articles/git-credentials.html#get-a-personal-access-token-pat) +or a GITHUB_PAT environment variable - for more information, see the +vignette notes on +[authentication](https://docs.ropensci.org/piggyback/articles/piggyback.html#authentication). -We can also upload data to any existing release (defaults to `latest`): +We can also upload data to a release. Start by creating a release: ``` r -## We'll need some example data first. -## Pro tip: compress your tabular data to save space & speed upload/downloads -readr::write_tsv(mtcars, "mtcars.tsv.gz") +pb_release_create(repo = "cboettig/piggyback-tests", tag = "v0.0.2") +#> ✔ Created new release "v0.0.2". +``` + +then upload to it: +``` r +readr::write_tsv(mtcars, "mtcars.tsv.gz") pb_upload("mtcars.tsv.gz", repo = "cboettig/piggyback-tests") +#> ℹ Uploading to latest release: "v0.0.2". +#> ℹ Uploading mtcars.tsv.gz ... +#> |===================================================| 100% ``` -## Git LFS and other alternatives - -`piggyback` acts like a poor soul’s [Git -LFS](https://git-lfs.com/). Git LFS is not only expensive, it -also [breaks GitHub’s collaborative -model](https://angryfrenchman.org/github-s-large-file-storage-is-no-panacea-for-open-source-quite-the-opposite-12c0e16a9a91) -– basically if someone wants to submit a PR with a simple edit to your -docs, they cannot fork your repository since that would otherwise count -against your Git LFS storage. Unlike Git LFS, `piggyback` doesn’t take -over your standard `git` client, it just perches comfortably on the -shoulders of your existing GitHub API. Data can be versioned by -`piggyback`, but relative to `git LFS` versioning is less strict: -uploads can be set as a new version or allowed to overwrite previously -uploaded data. +For improved performance, we can also use piggyback files with [cloud +native](https://docs.ropensci.org/piggyback/articles/cloud_native.html) +workflows to query data without downloading it first. -## But what will GitHub think of this? +## Motivations -[GitHub -documentation](https://docs.github.com/en/github/managing-large-files/distributing-large-binaries) -at the time of writing endorses the use of attachments to releases as a -solution for distributing large files as part of your project: +A brief video overview presented as part of Tan Ho’s [RStudioConf2022 +talk](https://www.youtube.com/watch?v=wzcz4xNGeTI&t=655s): -![](man/figures/github-policy.png) + -Of course, it will be up to GitHub to decide if this use of release -attachments is acceptable in the long term. +`piggyback` allows you to store data alongside your repository as +release assets, which helps you: - +- store files larger than 50MB +- bypass the 2GB GitHub repo size limit +- avoid the [downsides](https://archive.is/3D16r) of Git LFS +- version data flexibly (by creating/uploading to a new release) +- work with public and private repositories, **for free** -Also see our [vignette comparing -alternatives](https://docs.ropensci.org/piggyback/articles/alternatives.html). +For more about motivations, see this discussion of +[alternatives](https://docs.ropensci.org/piggyback/articles/alternatives.html). ------------------------------------------------------------------------- +## Contributing Please note that this project is released with a [Contributor Code of Conduct](https://ropensci.org/code-of-conduct/). By participating in this project you agree to abide by its terms. -[![ropensci\_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org) +[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org) diff --git a/codemeta.json b/codemeta.json index 53d32c2..af1aeb7 100644 --- a/codemeta.json +++ b/codemeta.json @@ -7,13 +7,13 @@ "codeRepository": "https://github.com/ropensci/piggyback", "issueTracker": "https://github.com/ropensci/piggyback/issues", "license": "https://spdx.org/licenses/GPL-3.0", - "version": "0.1.1.9002", + "version": "0.1.5.9003", "programmingLanguage": { "@type": "ComputerLanguage", "name": "R", "url": "https://r-project.org" }, - "runtimePlatform": "R version 4.1.0 (2021-05-18)", + "runtimePlatform": "R version 4.3.2 (2023-10-31)", "provider": { "@id": "https://cran.r-project.org", "@type": "Organization", @@ -27,6 +27,12 @@ "familyName": "Boettiger", "email": "cboettig@gmail.com", "@id": "https://orcid.org/0000-0002-1642-628X" + }, + { + "@type": "Person", + "givenName": "Tan", + "familyName": "Ho", + "@id": "https://orcid.org/0000-0001-8388-5155" } ], "contributor": [ @@ -47,12 +53,6 @@ "givenName": "Kevin", "familyName": "Kuo", "@id": "https://orcid.org/0000-0001-7803-7901" - }, - { - "@type": "Person", - "givenName": "Tan", - "familyName": "Ho", - "@id": "https://orcid.org/0000-0001-8388-5155" } ], "copyrightHolder": [ @@ -124,56 +124,51 @@ }, { "@type": "SoftwareApplication", - "identifier": "gert", - "name": "gert", + "identifier": "knitr", + "name": "knitr", "provider": { "@id": "https://cran.r-project.org", "@type": "Organization", "name": "Comprehensive R Archive Network (CRAN)", "url": "https://cran.r-project.org" }, - "sameAs": "https://CRAN.R-project.org/package=gert" - }, - { - "@type": "SoftwareApplication", - "identifier": "datasets", - "name": "datasets" + "sameAs": "https://CRAN.R-project.org/package=knitr" }, { "@type": "SoftwareApplication", - "identifier": "knitr", - "name": "knitr", + "identifier": "rmarkdown", + "name": "rmarkdown", "provider": { "@id": "https://cran.r-project.org", "@type": "Organization", "name": "Comprehensive R Archive Network (CRAN)", "url": "https://cran.r-project.org" }, - "sameAs": "https://CRAN.R-project.org/package=knitr" + "sameAs": "https://CRAN.R-project.org/package=rmarkdown" }, { "@type": "SoftwareApplication", - "identifier": "rmarkdown", - "name": "rmarkdown", + "identifier": "gert", + "name": "gert", "provider": { "@id": "https://cran.r-project.org", "@type": "Organization", "name": "Comprehensive R Archive Network (CRAN)", "url": "https://cran.r-project.org" }, - "sameAs": "https://CRAN.R-project.org/package=rmarkdown" + "sameAs": "https://CRAN.R-project.org/package=gert" }, { "@type": "SoftwareApplication", - "identifier": "usethis", - "name": "usethis", + "identifier": "withr", + "name": "withr", "provider": { "@id": "https://cran.r-project.org", "@type": "Organization", "name": "Comprehensive R Archive Network (CRAN)", "url": "https://cran.r-project.org" }, - "sameAs": "https://CRAN.R-project.org/package=usethis" + "sameAs": "https://CRAN.R-project.org/package=withr" }, { "@type": "SoftwareApplication", @@ -263,29 +258,29 @@ }, "7": { "@type": "SoftwareApplication", - "identifier": "lubridate", - "name": "lubridate", + "identifier": "memoise", + "name": "memoise", "provider": { "@id": "https://cran.r-project.org", "@type": "Organization", "name": "Comprehensive R Archive Network (CRAN)", "url": "https://cran.r-project.org" }, - "sameAs": "https://CRAN.R-project.org/package=lubridate" + "sameAs": "https://CRAN.R-project.org/package=memoise" }, "8": { "@type": "SoftwareApplication", - "identifier": "memoise", - "name": "memoise", + "identifier": "rlang", + "name": "rlang", "provider": { "@id": "https://cran.r-project.org", "@type": "Organization", "name": "Comprehensive R Archive Network (CRAN)", "url": "https://cran.r-project.org" }, - "sameAs": "https://CRAN.R-project.org/package=memoise" + "sameAs": "https://CRAN.R-project.org/package=rlang" }, "SystemRequirements": null }, - "fileSize": "357.526KB" + "fileSize": "380.757KB" } diff --git a/man/pb_download.Rd b/man/pb_download.Rd index 89a10ff..9b0b082 100644 --- a/man/pb_download.Rd +++ b/man/pb_download.Rd @@ -65,7 +65,7 @@ Download data from an existing release list.files(dest) }) \dontshow{ - unlink(list.files(dest, full.names = TRUE)) + try(unlink(list.files(dest, full.names = TRUE))) } } } diff --git a/vignettes/alternatives.Rmd b/vignettes/alternatives.Rmd index 65af3da..c5a2b9b 100644 --- a/vignettes/alternatives.Rmd +++ b/vignettes/alternatives.Rmd @@ -1,7 +1,7 @@ --- title: "Piggyback comparison to alternatives" author: "Carl Boettiger" -date: "`r Sys.Date()`" +date: "2018-09-18" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{alternatives} diff --git a/vignettes/cloud_native.Rmd b/vignettes/cloud_native.Rmd new file mode 100644 index 0000000..238de4e --- /dev/null +++ b/vignettes/cloud_native.Rmd @@ -0,0 +1,283 @@ +--- +title: "Cloud native workflows with piggyback" +output: rmarkdown::html_vignette +author: "Tan Ho, Carl Boettiger" +date: "2023-12-26" +vignette: > + %\VignetteIndexEntry{Cloud native workflows with piggyback} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + eval = FALSE +) +``` +```{r} +library(piggyback) +``` +## Data Too Big To Fit In Memory + +One of the primary advantages of `piggyback` is the ability to store a lot of +fairly large files. This is also potentially the source of some frustrations: +piggyback assets may potentially be quite large (too large to fit in RAM) and +difficult to work with once they have been uploaded to the release. + +There are a substantial and rapidly growing number of packages that are able to +work with data on-disk without reading the whole thing into memory, including +[`terra`](https://rspatial.github.io/terra/), [`stars`](https://r-spatial.github.io/stars/), +and [`sf`](https://r-spatial.github.io/sf/index.html) for large spatial assets, +as well as [`arrow`](https://arrow.apache.org/docs/r/) and +[`duckdb`](https://duckdb.org/docs/api/r.html) for tabular data. + +Going a step further, such libraries now also make it possible to not only skip +the 'read twice' pattern of downloading once to disk and reading to disk, but can +let you skip ever reading the whole data file into R at all - for instance, spatial +packages can use GDAL's [virtual file system](https://gdal.org/user/virtual_file_systems.html). + +`arrow` and `duckdb` can do similar tricks on parquet and csv files, allowing +users to leverage functions like `dplyr::select()` and `dplyr::filter()` directly +on the remote data source to access only the subset of rows/columns they need. +Subsetting data directly from a URL in this manner thus has the performance benefit +of reading directly into memory while also having the added benefit of allowing +more efficient and bigger-than-RAM workflows. This is sometimes referred to as +**cloud-native** workflows. + +## nflverse play by play + +This vignette shows some examples of using `duckdb` for querying larger datasets, +using example data from the [`nflverse`](https://github.com/nflverse) project +for NFL football analytics. (Consult the nflverse's [nflreadr](https://nflreadr.nflverse.com) +package if looking to work with NFL data beyond this example) + +The [nflverse/nflverse-data](https://github.com/nflverse/nflverse-data/releases) +data repository is organized into one release for a specific dataframe and typically +sharded into multiple files (and file formats) by season. Here's a brief glimpse +at how this looks under the piggyback lens: + +```{r} +pb_releases("nflverse/nflverse-data") +#> # A data.frame: 20 × 10 +#> release_name release_id release_body tag_name draft created_at published_at +#> +#> 1 rosters 58152863 "Roster data, acce… rosters FALSE 2022-01-2… 2022-01-28T… +#> 2 player_stats 58152881 "Play by play data… player_… FALSE 2022-01-2… 2022-01-28T… +#> 3 pbp 58152862 "Play by play data… pbp FALSE 2022-01-2… 2022-01-28T… +#> 4 pfr_advstats 58152981 "PFR Adv Stats dat… pfr_adv… FALSE 2022-01-2… 2022-01-28T… +#> 5 depth_charts 58152948 "Depth chart data,… depth_c… FALSE 2022-01-2… 2022-01-28T… +#> # ℹ 15 more rows +#> # ℹ 3 more variables: html_url , upload_url , n_assets +#> # ℹ Use `print(n = ...)` to see more rows + +pb_list(repo = "nflverse/nflverse-data", tag = "pbp") +#> # A data.frame: 148 × 6 +#> file_name size timestamp tag owner repo +#> +#> 1 play_by_play_2023.rds 12308832 2023-12-26 17:10:52 pbp nflv… nflv… +#> 2 play_by_play_2023.parquet 17469950 2023-12-26 17:11:02 pbp nflv… nflv… +#> 3 play_by_play_2023.csv 84490319 2023-12-26 17:10:58 pbp nflv… nflv… +#> 4 play_by_play_2022.rds 14387514 2023-02-28 09:25:26 pbp nflv… nflv… +#> 5 play_by_play_2022.parquet 20003378 2023-02-28 09:25:35 pbp nflv… nflv… +#> 6 play_by_play_2022.csv 97205016 2023-02-28 09:25:31 pbp nflv… nflv… +#> # ℹ 143 more rows +#> # ℹ Use `print(n = ...)` to see more rows + +pb_download_url( + "play_by_play_2023.csv", + repo = "nflverse/nflverse-data", + tag = "pbp" +) |> + read.csv() |> + dplyr::glimpse() +#> Rows: 42,066 +#> Columns: 372 +#> $ play_id 1, 39, 55, 77, 102, 124, 147… +#> $ game_id "2023_01_ARI_WAS", "2023_01_… +#> $ home_team "WAS", "WAS", "WAS", "WAS", … +#> $ away_team "ARI", "ARI", "ARI", "ARI", … +#> $ season_type "REG", "REG", "REG", "REG", … +#> $ week 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… +#> $ posteam "", "WAS", "WAS", "WAS", "WA… +#> $ posteam_type "", "home", "home", "home", … +#> $ defteam "", "ARI", "ARI", "ARI", "AR… +#> $ yardline_100 NA, 35, 75, 72, 66, 64, 64, … +#> $ down NA, NA, 1, 2, 3, 1, 2, 1, 2,… +#> $ play_type "", "kickoff", "run", "pass"… +``` +We'll look at the play by play release data and try to calculate some summary +statistics, without downloading it or reading it all into RAM... + +## DuckDB + +Packages used in this section: +```{r} +library(piggyback) +library(DBI) +library(duckdb) +library(dplyr) +library(glue) +library(tictoc) +``` +First, initialize duckdb and install/load `httpfs` (short for http file system) +```{r} +conn <- DBI::dbConnect(duckdb::duckdb()) +DBI::dbExecute(conn, "INSTALL 'httpfs'; LOAD 'httpfs';") +``` +Next, we'll need to get all of the relevant play-by-play URLs from the release - +we can do this with `pb_download_url` - and pass it into duckdb's +[read_parquet function](https://duckdb.org/docs/data/multiple_files/overview) +```{r} +tictoc::tic() +pbp_urls <- pb_download_url(repo = "nflverse/nflverse-data", tag = "pbp") +# keep only the ones matching the desired regex pattern, "play_by_play_####.parquet" +pbp_urls <- pbp_urls[grepl("play_by_play_\\d+.parquet", pbp_urls)] + +query <- glue::glue_sql("SELECT COUNT(*) as row_count FROM read_parquet([{pbp_urls *}])", .con = conn) + +DBI::dbGetQuery(conn = conn, query) +#> row_count +#> 1 1190783 +tictoc::toc() +#> 2.845 sec elapsed +``` +Now, we can construct a SQL query that summarizes the data: +```{r} +tictoc::tic() +query <- glue::glue_sql( + " + SELECT + season, + posteam, + play_type, + COUNT(play_id) AS n_plays, + AVG(epa) AS epa_per_play + FROM read_parquet([{pbp_urls *}], filename = true) + WHERE filename SIMILAR TO '.*(2021|2022|2023).*' + AND (pass = 1 OR rush = 1) + GROUP BY season, posteam, play_type + ORDER BY season DESC, posteam ASC, n_plays DESC + ", + .con = conn +) + +DBI::dbGetQuery(conn = conn, query) +#> # A data.frame: 288 × 5 +#> season posteam play_type n_plays epa_per_play +#> +#> 1 2023 ARI pass 539 -0.231 +#> 2 2023 ARI run 391 0.0351 +#> 3 2023 ARI no_play 48 0.191 +#> 4 2023 ATL pass 499 -0.0738 +#> 5 2023 ATL run 465 -0.103 +#> # ℹ 283 more rows +#> # ℹ Use `print(n = ...)` to see more rows + +tictoc::toc() +#> 3.343 sec elapsed +``` +You can also turn this into a view and query it with dbplyr/dplyr instead: +```{r} +query <- glue::glue_sql( + " + CREATE VIEW pbp AS + SELECT + * + FROM read_parquet([{pbp_urls *}], filename = true) + ", + .con = conn +) +DBI::dbExecute(conn, query) +pbp <- dplyr::tbl(conn, "pbp") +tictoc::tic() +pbp |> + dplyr::filter(grepl("2021|2022|2023", filename), pass == 1 | rush == 1) |> + dplyr::summarise( + n_plays = dplyr::n(), + epa_per_play = mean(epa, na.rm = TRUE), + .by = c(season, posteam, play_type) + ) |> + dplyr::arrange( + desc(season), posteam, desc(n_plays) + ) |> + dplyr::collect() +#> # A tibble: 288 × 5 +#> season posteam play_type n_plays epa_per_play +#> +#> 1 2023 ARI pass 539 -0.231 +#> 2 2023 ARI run 391 0.0351 +#> 3 2023 ARI no_play 48 0.191 +#> 4 2023 ATL pass 499 -0.0738 +#> 5 2023 ATL run 465 -0.103 +#> # ℹ 283 more rows +#> # ℹ Use `print(n = ...)` to see more rows +tictoc::toc() +#> 3.491 sec elapsed +``` + +Using duckdb certainly adds a little verbosity - in exchange, we've managed to +query and summarize the 20+ parquet files summing 1M+ rows without having +to load it all into memory! + +### duckdbfs + +[duckdbfs](https://cran.r-project.org/package=duckdbfs) was developed to wrap +this latter workflow into a single function call that accepts a vector of URLs: +```{r} +library(duckdbfs) +pbp <- duckdbfs::open_dataset(pbp_urls, filename = TRUE) +tictoc::tic() +pbp |> + dplyr::filter(grepl("2021|2022|2023", filename), pass == 1 | rush == 1) |> + dplyr::summarise( + n_plays = dplyr::n(), + epa_per_play = mean(epa, na.rm = TRUE), + .by = c(season, posteam, play_type) + ) |> + dplyr::arrange( + desc(season), posteam, desc(n_plays) + ) |> + dplyr::collect() +#> # A tibble: 288 × 5 +#> season posteam play_type n_plays epa_per_play +#> +#> 1 2023 ARI pass 539 -0.231 +#> 2 2023 ARI run 391 0.0351 +#> 3 2023 ARI no_play 48 0.191 +#> 4 2023 ATL pass 499 -0.0738 +#> 5 2023 ATL run 465 -0.103 +#> # ℹ 283 more rows +#> # ℹ Use `print(n = ...)` to see more rows +tictoc::toc() +#> 3.492 sec elapsed +``` + + + + diff --git a/vignettes/intro.Rmd b/vignettes/intro.Rmd deleted file mode 100644 index 9547ee6..0000000 --- a/vignettes/intro.Rmd +++ /dev/null @@ -1,181 +0,0 @@ ---- -title: "Piggyback Data atop your GitHub Repository!" -author: "Carl Boettiger" -date: "`r Sys.Date()`" -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{piggyback} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r setup, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>", - results="hide", - eval = Sys.getenv("CBOETTIG_TOKEN", FALSE) -) - -Sys.setenv(piggyback_cache_duration=0) - -``` - - - -# Why `piggyback`? - -`piggyback` grew out of the needs of students both in my classroom and in my research group, who frequently need to work with data files somewhat larger than one can conveniently manage by committing directly to GitHub. As we frequently want to share and run code that depends on >50MB data files on each of our own machines, on continuous integration, and on larger computational servers, data sharing quickly becomes a bottleneck. - -[GitHub allows](https://docs.github.com/en/github/managing-large-files/distributing-large-binaries) repositories to attach files of up to 2 GB each to releases as a way to distribute large files associated with the project source code. There is no limit on the number of files or bandwidth to deliver them. - -## Installation - -Install the latest release from CRAN using: - -``` r -install.packages("piggyback") -``` - -You can install the development version from [GitHub](https://github.com/) with: - -``` r -# install.packages("devtools") -devtools::install_github("ropensci/piggyback") -``` - -## Authentication - -No authentication is required to download data from *public* GitHub repositories using `piggyback`. Nevertheless, `piggyback` recommends setting a token when possible to avoid rate limits. To upload data to any repository, or to download data from *private* repositories, you will need to authenticate first. - -To do so, add your [GitHub Token](https://github.com/settings/tokens/new?scopes=repo,gist&description=R:GITHUB_PAT) to an environmental variable, e.g. in a `.Renviron` file in your home directory or project directory (any private place you won't upload), see `usethis::edit_r_environ()`. For one-off use you can also set your token from the R console using: - -```r -Sys.setenv(GITHUB_PAT="xxxxxx") -``` - -But try to avoid putting `Sys.setenv()` in any R scripts -- remember, the goal here is to avoid writing your private token in any file that might be shared, even privately. - -For more information, please see the [usethis guide to GitHub credentials](https://usethis.r-lib.org/articles/git-credentials.html) - -## Downloading data - -Download the latest version or a specific version of the data: - -```r -library(piggyback) -``` - -```r -pb_download("iris2.tsv.gz", - repo = "cboettig/piggyback-tests", - tag = "v0.0.1", - dest = tempdir()) -``` - -**Note**: Whenever you are working from a location inside a git repository corresponding to your GitHub repo, you can simply omit the `repo` argument and it will be detected automatically. Likewise, if you omit the release `tag`, the `pb_download` will simply pull data from most recent release (`latest`). Third, you can omit `tempdir()` if you are using an RStudio Project (`.Rproj` file) in your repository, and then the download location will be relative to Project root. `tempdir()` is used throughout the examples only to meet CRAN policies and is unlikely to be the choice you actually want here. - - -Lastly, simply omit the file name to download all assets connected with a given release. - -```r -pb_download(repo = "cboettig/piggyback-tests", - tag = "v0.0.1", - dest = tempdir()) -``` - -These defaults mean that in most cases, it is sufficient to simply call `pb_download()` without additional arguments to pull in any data associated with a project on a GitHub repo that is too large to commit to git directly. - -`pb_download()` will skip the download of any file that already exists locally if the timestamp on the local copy is more recent than the timestamp on the GitHub copy. `pb_download()` also includes arguments to control the timestamp behavior, progress bar, whether existing files should be overwritten, or if any particular files should not be downloaded. See function documentation for details. - - -Sometimes it is preferable to have a URL from which the data can be read in directly, rather than downloading the data to a local file. For example, such a URL can be embedded directly into another R script, avoiding any dependence on `piggyback` (provided the repository is already public.) To get a list of URLs rather than actually downloading the files, use `pb_download_url()`: - -```r -pb_download_url("data/mtcars.tsv.gz", - repo = "cboettig/piggyback-tests", - tag = "v0.0.1") -``` - -## Uploading data - -If your GitHub repository doesn't have any [releases](https://docs.github.com/en/github/administering-a-repository/managing-releases-in-a-repository) yet, `piggyback` will help you quickly create one. Create new releases to manage multiple versions of a given data file. While you can create releases as often as you like, making a new release is by no means necessary each time you upload a file. If maintaining old versions of the data is not useful, you can stick with a single release and upload all of your data there. - -```r -pb_new_release("cboettig/piggyback-tests", "v0.0.2") -``` - -Once we have at least one release available, we are ready to upload. By default, `pb_upload` will attach data to the latest release. - -```r -## We'll need some example data first. -## Pro tip: compress your tabular data to save space & speed upload/downloads -readr::write_tsv(mtcars, "mtcars.tsv.gz") - -pb_upload("mtcars.tsv.gz", - repo = "cboettig/piggyback-tests", - tag = "v0.0.1") -``` - -Like `pb_download()`, `pb_upload()` will overwrite any file of the same name already attached to the release file by default, unless the timestamp the previously uploaded version is more recent. You can toggle these settings with `overwrite=FALSE` and `use_timestamps=FALSE`. - - -## Additional convenience functions - -List all files currently piggybacking on a given release. Omit the `tag` to see files on all releases. - - -```r -pb_list(repo = "cboettig/piggyback-tests", - tag = "v0.0.1") -``` - -Delete a file from a release: - -```r -pb_delete(file = "mtcars.tsv.gz", - repo = "cboettig/piggyback-tests", - tag = "v0.0.1") -``` - -Note that this is irreversible unless you have a copy of the data elsewhere. - -## Multiple files - -You can pass in a vector of file paths with something like `list.files()` to the `file` argument of `pb_upload()` in order to upload multiple files. Some common patterns: - -```r -library(magrittr) - -## upload a folder of data -list.files("data") %>% - pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1") - -## upload certain file extensions -list.files(pattern = c("*.tsv.gz", "*.tif", "*.zip")) %>% - pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1") - -``` -Similarly, you can download all current data assets of the latest or specified release by using `pb_download()` with no arguments. - -## Caching - -To reduce API calls to GitHub, piggyback caches most calls with a timeout of 1 second by default. This avoids repeating identical requests to update it's internal record of the repository data (releases, assets, timestamps, etc) during programmatic use. You can increase or decrease this delay by setting the environmental variable in seconds, e.g. `Sys.setenv("piggyback_cache_duration"=10)` for a longer delay or `Sys.setenv("piggyback_cache_duration"=0)` to disable caching, and then restarting R. - -## Valid file names - -GitHub assets attached to a release do not support file paths, and will convert most special characters (`#`, `%`, etc) to `.` or throw an error (e.g. for file names containing `$`, `@`, `/`). piggyback will default to using the base name of the file only (i.e. will only use `"mtcars.csv"` if provided a file path like `"data/mtcars.csv"`) - -## A Note on GitHub Releases vs Data Archiving - -`piggyback` is not intended as a data archiving solution. Importantly, bear in mind that there is nothing special about multiple "versions" in releases, as far as data assets uploaded by `piggyback` are concerned. The data files `piggyback` attaches to a Release can be deleted or modified at any time -- creating a new release to store data assets is the functional equivalent of just creating new directories `v0.1`, `v0.2` to store your data. (GitHub Releases are always pinned to a particular `git` tag, so the code/git-managed contents associated with repo are more immutable, but remember our data assets just piggyback on top of the repo). - -Permanent, published data should always be archived in a proper data repository with a DOI, such as [zenodo.org](https://zenodo.org). Zenodo can freely archive public research data files up to 50 GB in size, and data is strictly versioned (once released, a DOI always refers to the same version of the data, new releases are given new DOIs). `piggyback` is meant only to lower the friction of working with data during the research process. (e.g. provide data accessible to collaborators or continuous integration systems during research process, including for private repositories.) - -## What will GitHub think of this? - -[GitHub documentation](https://docs.github.com/en/github/managing-large-files/distributing-large-binaries) at the time of writing endorses the use of attachments to releases as a solution for distributing large files as part of your project: - -![](https://github.com/ropensci/piggyback/raw/83776863b34bb1c9962154608a5af41867a0622f/man/figures/github-policy.png) - -Of course, it will be up to GitHub to decide if this use of release attachments is acceptable in the long term. diff --git a/vignettes/piggyback.Rmd b/vignettes/piggyback.Rmd new file mode 100644 index 0000000..1e8e05a --- /dev/null +++ b/vignettes/piggyback.Rmd @@ -0,0 +1,283 @@ +--- +title: "Piggyback Data atop your GitHub Repository!" +author: "Carl Boettiger & Tan Ho" +date: "2023-12-26" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{piggyback} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- +```{r setup, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + results="hide", + eval = Sys.getenv("TAN_GH_TOKEN", FALSE) +) + +Sys.setenv(piggyback_cache_duration=0) +``` + +## Why `piggyback`? + +`piggyback` grew out of the needs of students both in my classroom and in my research +group, who frequently need to work with data files somewhat larger than one can +conveniently manage by committing directly to GitHub. As we frequently want to +share and run code that depends on >50MB data files on each of our own machines, +on continuous integration, and on larger computational servers, data sharing +quickly becomes a bottleneck. + +[GitHub allows](https://docs.github.com/en/github/managing-large-files/distributing-large-binaries) +repositories to attach files of up to 2 GB each to releases as a way to distribute +large files associated with the project source code. There is no limit on the +number of files or bandwidth to deliver them. + +## Authentication + +No authentication is required to download data from *public* GitHub repositories +using `piggyback`. Nevertheless, `piggyback` recommends setting a token when +possible to avoid rate limits. To upload data to any repository, or to download +data from *private* repositories, you will need to authenticate first. + +`piggyback` uses the same GitHub Personal Access Token (PAT) that devtools, usethis, and +friends use (`gh::gh_token()`). The current best practice for managing your GitHub +credentials is detailed in this [usethis vignette](https://usethis.r-lib.org/articles/git-credentials.html). + +You can also add the token as an environment variable, which may be useful in +situations where you use piggyback non-interactively (i.e.scheduled/automated scripts). +Here are the relevant steps: + +- Create a [GitHub Token](https://github.com/settings/tokens/new?scopes=repo,gist&description=PIGGYBACK_PAT) +- Add the environment variable. You can do this: + - via project-specific Renviron: `usethis::edit_r_environ("project")`. You should + then add the Renviron to your gitignore via `usethis::use_git_ignore(".Renviron")`. + **Avoid committing your GITHUB_PAT to the repository for security reasons!** + - via `Sys.setenv(GITHUB_PAT = "{your token}")` in your console for oneoff usage. + Avoid adding this line to your R scripts -- remember, the goal here is to avoid + writing your private token in any file that might be shared, even privately. + +## Downloading data + +Download a file from a release: +```r +library(piggyback) +pb_download("iris2.tsv.gz", + repo = "cboettig/piggyback-tests", + tag = "v0.0.1", + dest = tempdir()) +``` +``` +ℹ Downloading "iris2.tsv.gz"... + |======================================================| 100% +``` +```r +fs::dir_tree(tempdir()) +``` +``` +/tmp/RtmpWxJSZj +└── iris2.tsv.gz +``` + +**Tips:** + +1. Whenever you are working from a location inside a git repository corresponding +to your GitHub repo, you can simply omit the `repo` argument and it will be detected +automatically. +2. Likewise, if you omit the release `tag`, `pb_download` will simply pull data +from most recent release (`latest`). +3. You can omit `tempdir()` if you are using an RStudio Project (`.Rproj` file) +in your repository: download locations will be relative to Project root. +`tempdir()` is used throughout the examples only to meet CRAN policies and is +unlikely to be the choice you actually want here. + +4. Omit the file name to download all assets connected with a given release. +```r +pb_download(repo = "cboettig/piggyback-tests", + tag = "v0.0.1", + dest = tempdir()) +``` +``` +ℹ Downloading "diamonds.tsv.gz"... + |======================================================| 100% +ℹ Downloading "iris.tsv.gz"... + |======================================================| 100% +ℹ Downloading "iris.tsv.xz"... + |======================================================| 100% +``` +```r +fs::dir_tree(tempdir()) +``` +``` +/tmp/RtmpWxJSZj +├── diamonds.tsv.gz +├── iris.tsv.gz +├── iris.tsv.xz +└── iris2.tsv.gz +``` + +These defaults mean that in most cases, it is sufficient to simply call `pb_download()` +without additional arguments to pull in any data associated with a project on a +GitHub repo that is too large to commit to git directly. + +Notice that above, `iris2.tsv.gz` was not downloaded. `pb_download()` will skip +downloading of any file that already exists locally, if the timestamp on the local copy is more recent than the timestamp on the GitHub copy. Use the `overwrite` parameter to control this behaviour. + +`pb_download()` also includes arguments to control the progress bar or if any particular +files should not be downloaded. + +### Download URLs + +Sometimes it is preferable to have a URL from which the data can be read in directly, +rather than downloading the data to a local file. For example, such a URL can be +embedded directly into another R script, avoiding any dependence on `piggyback` +(provided the repository is already public.) To get a list of URLs rather than +actually downloading the files, use `pb_download_url()`: + +```r +pb_download_url(repo = "cboettig/piggyback-tests", + tag = "v0.0.1") +``` +``` +[1] "https://github.com/cboettig/piggyback-tests/releases/download/v0.0.1/diamonds.tsv.gz" +[2] "https://github.com/cboettig/piggyback-tests/releases/download/v0.0.1/iris.tsv.gz" +[3] "https://github.com/cboettig/piggyback-tests/releases/download/v0.0.1/iris.tsv.xz" +[4] "https://github.com/cboettig/piggyback-tests/releases/download/v0.0.1/iris2.tsv.gz" +``` + +## Uploading data + +If your GitHub repository doesn't have any +[releases](https://docs.github.com/en/github/administering-a-repository/managing-releases-in-a-repository) +yet, `piggyback` will help you quickly create one. Create new releases to manage +multiple versions of a given data file, or to organize sets of files. + +While you can create releases as often as you like, making a new release is not +necessary each time you upload a file. If maintaining old versions of the data +is not useful, you can stick with a single release and upload all of your data +there. + +```r +pb_release_create(repo = "cboettig/piggyback-tests", tag = "v0.0.2") +``` +``` +✔ Created new release "v0.0.2". +``` + +Once we have at least one release available, we are ready to upload. By default, +`pb_upload` will attach data to the latest release. + +```r +## We'll need some example data first. +## Pro tip: compress your tabular data to save space & speed upload/downloads +readr::write_tsv(mtcars, "mtcars.tsv.gz") + +pb_upload("mtcars.tsv.gz", + repo = "cboettig/piggyback-tests") +``` +``` +ℹ Uploading to latest release: "v0.0.2". +ℹ Uploading mtcars.tsv.gz ... + |===================================================| 100% +``` + +Like `pb_download()`, `pb_upload()` will overwrite any file of the same name already +attached to the release file by default, unless the timestamp of the previously +uploaded version is more recent. You can toggle these settings with the `overwrite` +parameter. + +### Multiple files + +You can pass in a vector of file paths with something like `list.files()` to the `file` argument of `pb_upload()` in order to upload multiple files. Some common patterns: + +```r +library(magrittr) + +## upload a folder of data +list.files("data") %>% + pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1") + +## upload certain file extensions +list.files(pattern = c("*.tsv.gz", "*.tif", "*.zip")) %>% + pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1") + +``` +Similarly, you can download all current data assets of the latest or specified +release by using `pb_download()` with no arguments. + +## Deleting Files + +Delete a file from a release: + +```r +pb_delete(file = "mtcars.tsv.gz", + repo = "cboettig/piggyback-tests", + tag = "v0.0.1") +``` +``` +ℹ Deleted "mtcars.tsv.gz" from "v0.0.1" release on "cboettig/piggyback-tests" +``` +Note that this is irreversible unless you have a copy of the data elsewhere. + +## Listing Files + +List all files currently piggybacking on a given release. Omit `tag` to see +files on all releases. + +```r +pb_list(repo = "cboettig/piggyback-tests", + tag = "v0.0.1") +``` +``` + file_name size timestamp tag owner repo +1 diamonds.tsv.gz 571664 2021-09-07 23:38:31 v0.0.1 cboettig piggyback-tests +2 iris.tsv.gz 846 2021-08-05 20:00:09 v0.0.1 cboettig piggyback-tests +3 iris.tsv.xz 848 2020-03-07 06:18:32 v0.0.1 cboettig piggyback-tests +4 iris2.tsv.gz 846 2018-10-05 17:04:33 v0.0.1 cboettig piggyback-tests +``` + +## Caching + +To reduce GitHub API calls, piggyback caches `pb_releases` and `pb_list` with a +timeout of 10 minutes by default. This avoids repeating identical requests to +update its internal record of the repository data (releases, assets, timestamps, etc) +during programmatic use. You can increase or decrease this delay by setting the +environment variable in seconds, e.g. `Sys.setenv("piggyback_cache_duration" = 10)` +for a longer delay or `Sys.setenv("piggyback_cache_duration" = 0)` to disable caching, +and then restarting R. + +## Valid file names + +GitHub assets attached to a release do not support file paths, and will convert +most special characters (`#`, `%`, etc) to `.` or throw an error (e.g. for file +names containing `$`, `@`, `/`). `piggyback` will default to using the base name of +the file only (i.e. will only use `"mtcars.csv"` if provided a file path like +`"data/mtcars.csv"`) + +## A Note on GitHub Releases vs Data Archiving + +`piggyback` is not intended as a data archiving solution. Importantly, bear in +mind that there is nothing special about multiple "versions" in releases, as far +as data assets uploaded by `piggyback` are concerned. The data files `piggyback` +attaches to a Release can be deleted or modified at any time -- creating a new +release to store data assets is the functional equivalent of just creating new +directories `v0.1`, `v0.2` to store your data. (GitHub Releases are always pinned +to a particular `git` tag, so the code/git-managed contents associated with repo +are more immutable, but remember our data assets just piggyback on top of the repo). + +Permanent, published data should always be archived in a proper data repository +with a DOI, such as [zenodo.org](https://zenodo.org). Zenodo can freely archive +public research data files up to 50 GB in size, and data is strictly versioned +(once released, a DOI always refers to the same version of the data, new releases +are given new DOIs). `piggyback` is meant only to lower the friction of working +with data during the research process, (e.g. provide data accessible to collaborators +or continuous integration systems during research process, including for private +repositories.) + +## What will GitHub think of this? + +[GitHub documentation](https://docs.github.com/en/github/managing-large-files/distributing-large-binaries) at the time of writing endorses the use of attachments to releases as a +solution for distributing large files as part of your project: + +screenshot of GitHub docs linked above +