Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: check_enough_train_data #283

Merged
merged 18 commits into from
Jan 22, 2024
Merged

feat: check_enough_train_data #283

merged 18 commits into from
Jan 22, 2024

Conversation

dshemetov
Copy link
Contributor

Attempt at fixing #106.

It seems to work well (see the tests) for both geo pooled and non geo pooled (more generally key col pooled) cases, but I still don't know how to keep this from running at test time, where the number of data points available will be way different and may cause problems.

@dajmcdon any suggestions?

@dshemetov dshemetov requested a review from dajmcdon as a code owner January 18, 2024 22:37
@dsweber2 dsweber2 linked an issue Jan 19, 2024 that may be closed by this pull request
@dshemetov dshemetov changed the title wip: check_enough_train_data feat: check_enough_train_data Jan 19, 2024
Copy link
Contributor

@dsweber2 dsweber2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally lgtm, couple of inlines and a couple of commits you can take or leave (slightly too much to be easy to edit in github suggestions)

#' when [prep()] is run, some operations may not be able to be
#' conducted on new data (e.g. processing the outcome variable(s)).
#' Care should be taken when using `skip = TRUE` as it may affect
#' the computations for subsequent operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if skip=TRUE by default would solve the issue about running during fit vs predict? looks like you have a test demonstrating it does do that! So we definitely have a functional check for training data, if not test data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup! I'd like to handle test data checking next, unclear if that will be possible.

Copy link
Contributor

@dajmcdon dajmcdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing checks because purrr isn't in the namespace (though map() is available internally).

Also a few comments below. The simplification for counting is perhaps important (at least easier).

I'm requesting changes only because of the possible bug and the missing NAMESPACE.

@dshemetov dshemetov requested a review from dajmcdon January 20, 2024 04:11
@dajmcdon
Copy link
Contributor

@dshemetov One more thing before merging: can you add the check to the arx_classifier() and the smooth forecaster?

@dshemetov
Copy link
Contributor Author

Sure, let me take a look at how much work that will be.

@dshemetov dshemetov merged commit 6ddffb2 into main Jan 22, 2024
2 checks passed
@dshemetov dshemetov deleted the ds/check_enough_train_data branch January 22, 2024 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handling insufficient training data.
3 participants