Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text-based serializations / data exchange formats for fables? #353

Closed
cboettig opened this issue Nov 5, 2021 · 9 comments
Closed

text-based serializations / data exchange formats for fables? #353

cboettig opened this issue Nov 5, 2021 · 9 comments

Comments

@cboettig
Copy link

cboettig commented Nov 5, 2021

Hi @robjhyndman and fable team,

I have been using and teaching your Tidyverts approach to forecasting and wanted to say thanks for providing this amazingly well thought-out and implemented resource. In particular, I very much appreciate the support for distributional and grouped forecasts; both very important features for us that are rarely handled well in alternative approaches.

fable makes very clever use of list-columns to provide a powerful abstraction of distribution. As long-time tidyverse user I completely appreciate this approach, but I did want to get your opinion on how a fable ought to be represented in a generic file-based format that would be most compatible with other tools and languages? Providing file-based serializations may make it easier to share and distribute forecasts, including generic forecasting competitions. (I am currently part of a team of folks hosting an "ecological forecasting challenge" described in https://projects.ecoforecast.org/neon4cast-docs/).

The standard strategy of text-based files like csv (and some related standard serializations like parquet, netcdf, hdf5) seem a natural choice (happy to leave out grib, even if it is the standard way NOAA distributes forecast ensembles), but does not play nicely with list-columns. In our current iteration of the challenge, we have relied on participants submitting text-based serializations of distributional forecasts in which uncertainty is described either as mean + standard-deviation columns or as a set of ensemble draws. This is obviously not ideal -- one would like to express non-normal distributions without resorting to drawing an ensemble of samples from the distribution. Alternately one could define a convention such as fable distribution print method to summarize a distribution as parable text, but that may be too fragile or unreliable.

Given your experience with fable, I just wanted to see if you had any recommendations for file-based serializations of fables. In my ideal word, we would have methods like read_fable() / write_fable() to serialize and parse files into and out of the fable format (which would also allow users to more easily leverage fable for comparing across forecasts, etc). Do you see a good way to go about this? Would you consider such functionality as being in scope for fable?

Thanks for considering!

  • Carl
@mitchelloharawild
Copy link
Member

I have been using and teaching your Tidyverts approach to forecasting and wanted to say thanks for providing this amazingly well thought-out and implemented resource. In particular, I very much appreciate the support for distributional and grouped forecasts; both very important features for us that are rarely handled well in alternative approaches.

😍 Glad the interface is working well for you!

fable makes very clever use of list-columns to provide a powerful abstraction of distribution. As long-time tidyverse user I completely appreciate this approach, but I did want to get your opinion on how a fable ought to be represented in a generic file-based format that would be most compatible with other tools and languages? Providing file-based serializations may make it easier to share and distribute forecasts, including generic forecasting competitions. (I am currently part of a team of folks hosting an "ecological forecasting challenge" described in https://projects.ecoforecast.org/neon4cast-docs/).

The standard strategy of text-based files like csv (and some related standard serializations like parquet, netcdf, hdf5) seem a natural choice (happy to leave out grib, even if it is the standard way NOAA distributes forecast ensembles), but does not play nicely with list-columns. In our current iteration of the challenge, we have relied on participants submitting text-based serializations of distributional forecasts in which uncertainty is described either as mean + standard-deviation columns or as a set of ensemble draws. This is obviously not ideal -- one would like to express non-normal distributions without resorting to drawing an ensemble of samples from the distribution. Alternately one could define a convention such as fable distribution print method to summarize a distribution as parable text, but that may be too fragile or unreliable.

This is something really important that I haven't added to {distributional} yet - reversible cast methods between distributions and characters. In the short term it is possible to extract the parameters of a distribution using the parameters() function, and then use these parameters to reconstruct the distribution.

library(fable)
#> Loading required package: fabletools
library(distributional)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
x <- fable(
  date = Sys.Date() + 1:10,
  dist = dist_normal(1:10),
  index = date,
  response = "test",
  distribution = dist
) %>% 
  mutate(parameters(dist))
#> Warning: The dimnames of the fable's distribution are missing and have been set
#> to match the response variables.

x
#> # A fable: 10 x 4 [1D]
#>    date           dist    mu sigma
#>    <date>       <dist> <dbl> <dbl>
#>  1 2021-11-07  N(1, 1)     1     1
#>  2 2021-11-08  N(2, 1)     2     1
#>  3 2021-11-09  N(3, 1)     3     1
#>  4 2021-11-10  N(4, 1)     4     1
#>  5 2021-11-11  N(5, 1)     5     1
#>  6 2021-11-12  N(6, 1)     6     1
#>  7 2021-11-13  N(7, 1)     7     1
#>  8 2021-11-14  N(8, 1)     8     1
#>  9 2021-11-15  N(9, 1)     9     1
#> 10 2021-11-16 N(10, 1)    10     1
  
library(readr)
file <- tempfile(fileext = "csv")
write_csv(x, file)
read_csv(file) %>% 
  mutate(dist = dist_normal(mu, sigma)) %>% 
  as_fable(index = date, response = "test", distribution = dist)
#> Rows: 10 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (1): dist
#> dbl  (2): mu, sigma
#> date (1): date
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Warning: The dimnames of the fable's distribution are missing and have been set
#> to match the response variables.
#> # A fable: 10 x 4 [1D]
#>    date           dist    mu sigma
#>    <date>       <dist> <dbl> <dbl>
#>  1 2021-11-07  N(1, 1)     1     1
#>  2 2021-11-08  N(2, 1)     2     1
#>  3 2021-11-09  N(3, 1)     3     1
#>  4 2021-11-10  N(4, 1)     4     1
#>  5 2021-11-11  N(5, 1)     5     1
#>  6 2021-11-12  N(6, 1)     6     1
#>  7 2021-11-13  N(7, 1)     7     1
#>  8 2021-11-14  N(8, 1)     8     1
#>  9 2021-11-15  N(9, 1)     9     1
#> 10 2021-11-16 N(10, 1)    10     1

Created on 2021-11-06 by the reprex package (v2.0.0)

Having as_distribution(<character>) to directly convert N(a,b) to dist_normal(a, sqrt(b)) would be neat, but difficult to implement. I think that the as.character(<distribution>) method needs to be updated to be more computer parsable, and give something like <dist_type>(!!!<dist_pars>) (for example, dist_normal(mu=0,sd=1)). Any thoughts and suggestions here are appreciated!

Given your experience with fable, I just wanted to see if you had any recommendations for file-based serializations of fables. In my ideal word, we would have methods like read_fable() / write_fable() to serialize and parse files into and out of the fable format (which would also allow users to more easily leverage fable for comparing across forecasts, etc). Do you see a good way to go about this? Would you consider such functionality as being in scope for fable?

I don't think we need specific read/write functionality for fable, but rather better support for reading/writing distributions. If you want to read/write a fable object specifically, I would use suggest using rds.

@mitchelloharawild
Copy link
Member

That said, writing some distributions to character can be extra tricky when they may depend on user environments.
For example, dist_transformed() uses transformation functions which may not exist in the user's environment, and dist_wrap() searches for p/d/q/r style functions in a specified environment.
There are also distribution modifiers which take one or more distribution and tweaks them to produce a new one (for inflation, truncation, mixtures, etc.)

@cboettig
Copy link
Author

cboettig commented Nov 5, 2021

@mitchelloharawild thanks for this! Yes, definitely agree these issues are specific to distribution and not fable more generally; happy to move the discussion over there if it helps.

The parameters() method is really nice (seems to be new?) thanks for showing me that. As a first pass at any rate, I would imagine similar methods to extract the other components of the distribution; i.e. dist and transform, as you mention. Not sure if there are already methods to do that, I found it quite cumbersome to do class(vctrs::vec_data(y)[[1]])[[1]] on a distribution object, e.g.:

library(fable)
#> Loading required package: fabletools
library(distributional)
library(dplyr, quietly = TRUE)

x <- fable(
  date = Sys.Date() + 1:10,
  dist = dist_normal(1:10),
  index = date,
  response = "test",
  distribution = dist
) %>% 
  mutate(parameters(dist))
#> Warning: The dimnames of the fable's distribution are missing and have been set
#> to match the response variables.

## probably a better way to do this??
dist_class <- function(y) class(vctrs::vec_data(y)[[1]])[[1]]

x %>% mutate(dist_class = dist_class(dist))
#> # A fable: 10 x 5 [1D]
#>    date           dist    mu sigma dist_class 
#>    <date>       <dist> <dbl> <dbl> <chr>      
#>  1 2021-11-06  N(1, 1)     1     1 dist_normal
#>  2 2021-11-07  N(2, 1)     2     1 dist_normal
#>  3 2021-11-08  N(3, 1)     3     1 dist_normal
#>  4 2021-11-09  N(4, 1)     4     1 dist_normal
#>  5 2021-11-10  N(5, 1)     5     1 dist_normal
#>  6 2021-11-11  N(6, 1)     6     1 dist_normal
#>  7 2021-11-12  N(7, 1)     7     1 dist_normal
#>  8 2021-11-13  N(8, 1)     8     1 dist_normal
#>  9 2021-11-14  N(9, 1)     9     1 dist_normal
#> 10 2021-11-15 N(10, 1)    10     1 dist_normal

Created on 2021-11-05 by the reprex package (v2.0.1)

Like you say, we'd want to do something similar for transformed. Also it's still not clear to me if that would be sufficient, sounds like due to dist_wrap, knowing dist_normal and the transformation may not be sufficient?

Even so, merely having parameters and distribution names is helpful to me in terms of serializing files that are more compatible with other formats, even if we cannot fully automate going back to the distribution object.

Just a note that rds is fine for other fable users, but is not a data exchange format, and not really an archival quality format either. Another research would have good odds of interpreting the data frame above correctly with the addition of the mu, sigma & dist_class columns even if they are working in a language like python; it's obviously much more fragile to assume the distribution s3 definitions are always available and unchanged.

@mitchelloharawild
Copy link
Member

The parameters() method is really nice (seems to be new?) thanks for showing me that.

Yes, <1 month new. Due for the next release soon.

As a first pass at any rate, I would imagine similar methods to extract the other components of the distribution; i.e. dist and transform, as you mention.

Yes for distribution names. Probably no for transformations - I would consider the transformation functions as parameters of a transformed distribution. That said, data exchange of R functions is problematic - so the parameters(<dist_transformed>) method may need some way of inverting the function's value to get its name. I don't know a good solution for this.

Like you say, we'd want to do something similar for transformed. Also it's still not clear to me if that would be sufficient, sounds like due to dist_wrap, knowing dist_normal and the transformation may not be sufficient?

The distribution suffix ('norm' in '(p/d/q/r)norm) is a parameter of the 'wrapped distribution'. So identifying the specific distribution here is actually easier.

library(distributional)
dist <- dist_wrap("norm", mean = 1:10, sd = 1)
dist
#> <distribution[10]>
#>  [1] norm(1, 1)  norm(2, 1)  norm(3, 1)  norm(4, 1)  norm(5, 1)  norm(6, 1) 
#>  [7] norm(7, 1)  norm(8, 1)  norm(9, 1)  norm(10, 1)
parameters(dist)
#>    dist mean sd
#> 1  norm    1  1
#> 2  norm    2  1
#> 3  norm    3  1
#> 4  norm    4  1
#> 5  norm    5  1
#> 6  norm    6  1
#> 7  norm    7  1
#> 8  norm    8  1
#> 9  norm    9  1
#> 10 norm   10  1

Created on 2021-11-06 by the reprex package (v2.0.0)

(requires latest dev of distributional, I added a custom method to remove the dist_wrap env to make it work)

Even so, merely having parameters and distribution names is helpful to me in terms of serializing files that are more compatible with other formats, even if we cannot fully automate going back to the distribution object.

Yes, also useful for tidyverts/fabletools#333 where you may want to run different code based on the distribution name.

Just a note that rds is fine for other fable users, but is not a data exchange format, and not really an archival quality format either. Another research would have good odds of interpreting the data frame above correctly with the addition of the mu, sigma & dist_class columns even if they are working in a language like python; it's obviously much more fragile to assume the distribution s3 definitions are always available and unchanged.

Completely agree. I suggest rds in lieu of read_fable()/write_fable() as these outputs seemed specific to fable. If wanting to work with the outputs in a different language (say Python), I think it is more appropriate to use write_csv() and improve the distribution character casting method to not be lossy.

@mitchelloharawild
Copy link
Member

Any suggested function names for obtaining the name of a distribution? Comparable to parameters(), but obviously can't be name(). I also don't like dist_name() as the dist_*() functions are used to create distributions.

@cboettig
Copy link
Author

cboettig commented Nov 6, 2021

@mitchelloharawild good question, I agree dist_name isn't right. I believe our named parametric shapes are often referred to as "family" of distribution? (a la column 2 of table 1 in https://arxiv.org/pdf/1709.04743.pdf)

mitchelloharawild added a commit to mitchelloharawild/distributional that referenced this issue Nov 6, 2021
@mitchelloharawild
Copy link
Member

I had also considered family() but was concerned about confusion with related topics like the 'exponential family'. However stats::family(object, ...) is used to identify specific distribution names, so it seems like an appropriate generic to add methods for distributions.

Here's a quick go at adding a family(<distribution>) method. The default method drops the dist_ prefix from the primary class. Do you think this is a reasonable return value to encompass a distribution's family?

library(distributional)
dist <- c(
  dist_normal(1:2),
  dist_poisson(3),
  dist_multinomial(size = c(4, 3),
                   prob = list(c(0.3, 0.5, 0.2), c(0.1, 0.5, 0.4)))
)
family(dist)
#> [1] "normal"      "normal"      "poisson"     "multinomial" "multinomial"

Created on 2021-11-06 by the reprex package (v2.0.0)

@cboettig
Copy link
Author

cboettig commented Nov 6, 2021

I like this! I think it would cover most cases adequately. I wonder if parameters() will actually be the more difficult part, if conventions around parameter names for some distributions aren't quite as universal as the names of the distribution families...

@mitchelloharawild
Copy link
Member

Closing as serialisation of fables is mostly a question of the vctrs classes they contain.
The format() method of time vectors from {tsibble} should be sufficient to reconstruct the time class from text, distributions from {distributional} now provide additional methods to access the distribution type and parameters for export/import to/from csv.
Further improvements and suggestions for this would be better suited to the packages providing the vctrs objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants