Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

holdout function to specify length using common language #180

Open
davidtedfordholt opened this issue Mar 19, 2020 · 16 comments
Open

holdout function to specify length using common language #180

davidtedfordholt opened this issue Mar 19, 2020 · 16 comments

Comments

@davidtedfordholt
Copy link
Contributor

It would be nice to specify a holdout period without having to use filter. It gets tricky if different series in a tsbl have different dates for their final observation. I suggest a function that takes either a length of holdout (e.g. "14 days") or a number of observations, as well as an optional start_date (or a number that represents how many observations from the end would be the first observation in the holdout, to deal with the problem of different end dates).

aus_retail %>%
    holdout(length = "6 months") %>%
    model(snaive = SNAIVE(Turnover)) %>%
    forecast() %>%
    autoplot(aus_retail)

This would be a step in the direction of cross-validation, especially if you passed the length of the holdout into the models such that forecast could automatically forecast for the length of the holdout. The above would work relatively simply, and a CV would, in a sense, just be mapping that over a list of cutoff_dates (or inverse-ordered observation numbers) with the same holdout period, leading to a cv-fbl that adds a column for the cutoff date.

The cross-validation would be relatively simple as:

aus_retail %>%
    CV(cv_length = "2 years", cv_horizon = "6 months") %>%
    model(snaive = SNAIVE(Turnover))

can generate a nested table or the like (converting cv_length and cv_horizon to numbers of observations in the frequency of the tsbl) with rows corresponding the the keys and the cutoff_dates per key, generated as simply as:

aus_retail %>% 
    as_tibble() %>% 
    group_by(State, Industry) %>%
    summarise(
        cutoff_dates = list(
            Month[
                which(
                    Month >= nth(Month, -(cv_length + cv_horizon) &&
                    Month < nth(Month, -cv_horizon)
                )])) 

From there, for a given key, you are just slicing the tsbl off at each cutoff_date in the list, modeling and forecasting, then binding the fbl results with a column for the cutoff as an additional index.

@davidtedfordholt
Copy link
Contributor Author

I'm sorry to kind of step on the toes of #177 . I wasn't sure if I should reply to that, since they are asking for a current method and I'm talking about a new function that might wind toward that.

@mitchelloharawild
Copy link
Member

I like this suggested interface, it is definitely simpler to understand.
Do you have any opinions on the data structure returned by CV()? Would this be the same structure as stretch_tsibble() or something else?

@davidtedfordholt
Copy link
Contributor Author

davidtedfordholt commented Mar 20, 2020

My fear for stretch_tsibble() is just that it'll be a painfully massive object, if the initial tsibble is sizeable. Maybe the primary object is a list with a tsibble with the keys from the original object and the cutoff dates that will be mapped over, as well as the original tsibble with it. That way you don't have either a very long one (with stretch_tsibble()) or a nested tibble with a bunch of unnecessary copies of the tsibble cut off.

I'm not sure there's any reason you would use CV() by itself without that encapsulating the model() and forecast() calls. You've already handed it what it needs for forecast(), so all that's left are the model specifications. Maybe CV() is specified with a length and maximum horizon, as well as either a mbl or model specifications the same as in model().

aus_retail %>%
    CV(
        length = "2 years", 
        max_horizon = "6 months",
        snaive = SNAIVE(Turnover)
    )

or

aus_retail %>%
    model(snaive = SNAIVE(Turnover)) %>%
    CV(length = "2 years", max_horizon = "6 months")

I think the second is a pretty decent way to go about it, but I realize that the models that are generated in the model() call are useless, as they are using all of the data to model. CV() would take each of the models and just call them repeatedly on different slices of the tsibble, collecting the fbl objects to bind together at the end of it.

@davidtedfordholt
Copy link
Contributor Author

In practice, if the models that are fit during the model() call in the latter interface take a while to fit, the CV() is still going to take far longer, since it's fitting those models repeatedly, so I don't know that it's really much of a punishment to "waste" the time of a first fitting.

@mitchelloharawild
Copy link
Member

Not expanding the tsibble structure is something @earowang and I have discussed before for stretch_tsibble (and there are implementations for this). The primary issue for this approach is that it didn't neatly preserve mappings from tsibble -> mable. That is to say currently, the mable (being rectangular) preserves the same key combinations/structure as the tsibble. While the folds of a tsibble can be easily computed from some CV metadata, the models (and subsequent methods like forecast and residuals) for each fold cannot.

Thinking about this again now, a possible solution for the mable issue is to create a 'folded model', which contains a model for each fold in the input data to match the input tsibble's key structure. The issue remains when you use model methods that return a tsibble/fable, as the forecasts/residuals/etc are now unique for each fold.

Any thoughts on this would be very useful, I don't think CV is in a great spot right now and improvements would be welcomed.

@davidtedfordholt
Copy link
Contributor Author

The goal of cross-validation is to understand the pattern of residuals in serial holdouts, yes? So we don't really care about the fitted models, just the forecasts and maybe the fitted values for each model. Those are unique by the fold and by the model. A fable object that is augmented with a column denoting the fold (by last observation index, etc) would contain all of the information we need. This could also be managed with a mable-like object that holds fables for each model in corresponding columns, with rows for keys.

Both would allow us to perform the tasks needed:

  • evaluate accuracy (method for accuracy())
  • graph forecasts
  • graph rolling accuracy

I feel like the extended fable is more parsimonious than the mable-of-fables concept, personally.

@davidtedfordholt
Copy link
Contributor Author

For a holdout() function, you could simply add a logical column .holdout that essentially masks the values from model() and forecast(). The value would pass through, meaning that it would be available to accuracy. The output of model() would be the same, with an extra column inside each model's data tsibble. The output of forecast() would be a fable with the additional column for the actual value.

For a cv() function, you could create logical columns for each fold (e.g. .fold_12) that do the same thing, passing that through to model() %>% forecast() with each fold column being used as the .holdout column for that pass. You skip the need to let users have a way to interact with the multiple mables and fables that exist in the middle. I would almost prefer a function like holdout() that sets up the cross-validation, and then cv(), which calls the model() %>% forecast() %>% bind_cv_fables() portion.

aus_retail %>%
    create_folds(length = "2 years", max_horizon = "6 months") %>%
    cross_validate(snaive = SNAIVE(Turnover))

@davidtedfordholt
Copy link
Contributor Author

One slight issue, when things scale up, with storing the fable objects for each fold, is that the distributions get pretty sizeable. One way to deal with that would be a percentile.fc_dist function. Holding the percentile of the actual value in the forecast distribution allows for coverage statistics to be calculated for a given horizon, without requiring the whole distribution to be passed forward.

@mitchelloharawild
Copy link
Member

Distribution object sizes should be fixed by https://github.com/mitchelloharawild/distributional, which I am currently converting the fabletools code to use.

@davidtedfordholt
Copy link
Contributor Author

You're a king, @mitchelloharawild .

@davidtedfordholt
Copy link
Contributor Author

This is another spot where the iteration operator would be helpful, though it would be operating over model specifications and dates from which it's forecasting, with a function that first divides the dataset and then forecasts. Essentially a forecast_as_of(date, data, model_formula). Instead of nesting calls to future.apply, it would be far more stable to consider an implementation where everything goes through pivot_longer and then future_mapply. It just might be a very long table.

@mitchelloharawild
Copy link
Member

By this do you suggest an alternative to stretch_tsibble() (and the likes) which computes the cross-validation folds as they are required? This is something that we were thinking a lot about last year, and ultimately we decided that a data-modifier class (which handles aggregation and time slicing) would work well. The parallelisation would still come from model(), but the input data is stored in a more efficient way.

There are a couple reasons why this hasn't been done yet (but hopefully will be later)

  1. If the data consumes too much space, then the models will consume more (the model objects store fitted residuals and other values which are at least the length of the data, there needs to be an option to cut these values to make for tiny model objects).
  2. Having the data available upfront for the users is intuitive, as it allows you to filter out aggregations and cross-validation folds of interest. Having another data class to store the operations until they're needed will be better computationally, but more complex for end-users (and so both options should exist alongside each other, for use in what is needed most).

@davidtedfordholt
Copy link
Contributor Author

To address 1., I don't think it's actually useful to hold onto the model object in CV. If the purpose of CV is to determine accuracy on successive holdouts, you can hold only the forecast distribution (or even just the mean, depending on what accuracy metrics you're asking for), eventually comparing those to the actual values.

I'm definitely on board with the idea of not having the data actually replicated. In the above, the data argument could be the first argument, and it could just be the tsibble of input data, which the function then splits up at the moment of modeling. The thing you're really iterating over is the set of dates, which have to be specified somehow (we already seem to have a grasp on that part).

The reason I worry about letting model and forecast deal with all the parallelization is the (perhaps wrong) intuition that most people will be executing CV across a limited number of models, but a larger number of dates. In those cases it seems like they're parallelizing across the variable that will leave the most cores on the table. (I must also admit that I have only inadvertently used future.apply function inside another future.apply function, and it caused a bit of an issue, but I didn't full investigate whether or not that was just how I'd implemented it, rather than it being unstable to calculate futures using futures)

@mitchelloharawild
Copy link
Member

To address 1., I don't think it's actually useful to hold onto the model object in CV. If the purpose of CV is to determine accuracy on successive holdouts, you can hold only the forecast distribution (or even just the mean, depending on what accuracy metrics you're asking for), eventually comparing those to the actual values.

I agree that the most common task for CV style folding over time is evaluating forecast accuracy, and that there should be a more efficient and simple workflow for doing this task. However I also think that more value can be extracted from the model objects, for example you could look at how the parameters vary over time to evaluate how often the model should be re-estimated.

The reason I worry about letting model and forecast deal with all the parallelization is the (perhaps wrong) intuition that most people will be executing CV across a limited number of models, but a larger number of dates. In those cases it seems like they're parallelizing across the variable that will leave the most cores on the table. (I must also admit that I have only inadvertently used future.apply function inside another future.apply function, and it caused a bit of an issue, but I didn't full investigate whether or not that was just how I'd implemented it, rather than it being unstable to calculate futures using futures)

model() should be parallelising across both series (including CV folds) and models. So the parallel performance for n series and m models should be the same as m series and n models (assuming data length and model complexity is comparable).

I'm interested in your experience of nesting future.apply() within another parallel process. Do you know if this causes issues or does it schedule tasks in a queue for parallel evaluation?

@davidtedfordholt
Copy link
Contributor Author

I'll run an experiment tomorrow on the nested future.apply bit and let you know.

It seems like the key to reducing the overall memory footprint of a CV process might be to determine what you are asking for in the output beforehand. It can be frustrating, since I've definitely rerun CVs multiple times because I remembered something extra I wanted out of it. Still, I think having it default to essentially giving you a fable with an added .fold_date key and a .resid column probably fulfills most use cases. Returning a more complex object, whether it's a full mable or just an extract thereof, is straightforward enough as a special case. I don't think most people are going to want to interrogate that mable directly, though, so it might be that, in providing methods to store and retrieve useful information, we can also determine the most lightweight way to do so for each type of useful information.

One option is to have the output be a cvable, pivoted longer with the .fold_date, .model and other keys, and then predefine whether the output keeps a .mbl column, a .fc column, or other columns that are subsets thereof. If you wind up with an impossible object, it's because you asked for too much, not because it was thrust on you.

@jeffzi
Copy link

jeffzi commented Sep 23, 2020

Do you know if this causes issues or does it schedule tasks in a queue for parallel evaluation?

By default, nested futures are evaluated sequentially to protect the user from overwhelming the machine. The documentation details the reasons in the section Nested Futures and Evaluation Topologies.

As @davidtedfordholt said, it's more efficient to parallelize on the highest dimension (model or series). In the context of CV, especially with a stretching window, you very often end up with many more folds than models. Parallelizing each model is inefficient if you have long series because too much time is spent on transferring the exact same data for each model.

One option is to have the output be a cvable, pivoted longer with the .fold_date, .model and other keys, and then predefine whether the output keeps a .mbl column, a .fc column, or other columns that are subsets thereof.

In the same vein, a solution could be to have separate functions or an argument to return fitted models along with the forecasts. Keeping the models can indeed be useful but they dramatically increase the memory footprint and slow down the CV as well because you need to return more data to the main thread. It would be nice if the user could choose to pay that price depending on what he intends to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants