Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve genericty of statistical features #6

Open
nicolas-vtg opened this issue Jan 26, 2022 · 1 comment
Open

improve genericty of statistical features #6

nicolas-vtg opened this issue Jan 26, 2022 · 1 comment

Comments

@nicolas-vtg
Copy link

In app/preprocess.py / the function used to compute statistical features (average number of individuals) needs to be improved.
(TODO of line 200)
Indeed, to make those calculus consistent, some data is filtered, nevertheless, the filter is an harcoded list of school year.
This filter is used to prevent to compute statistical features using the target values when using the repository on past data.
Nevertheless, it needs to be generalized.

To investigate, one could update this filter based on the data that is currently targeted.
For instance, if the user is predicting for school year 2021-2022, then data from '2021-2022' should be removed, but prior data can be kept ('2018-2019', '2019-2020', '2020-2021') since they represent the past.

    remove_real_lines = all_data[(all_data["annee_scolaire"] != "2021-2022")]

A fix would be to define the list of data to exclude based on parameters used to call the app in order to make it consistent for both

  • using the model on past value to analyse its performance
  • using the model with future data
def add_statistical_features(all_data, list_of_period_to_exclude):
    """
    compute statistical features using ratio, means etc
    """
    # TODO improve filtering here and remove NANs
    remove_real_lines = all_data[~(all_data["annee_scolaire"].isin(list_of_period_to_exclude)]

where list_of_period_to_exclude is a list of school_year (maybe a range of dates could be a nice evolution too since it would allow to consider recent data to update those features ?) that has been computed prior to the call to this function.

@fBedecarrats
Copy link
Contributor

The model produces reliable forecasts until the end of school year 2020-2021 (~16 000 meals/day). But it it starts generating very low values (~7000 meals/day) from Sept. 2021 (see detailed outputs for Sept.-Dec. 2021.
The initial function with the following at line 201:
remove_real_lines = all_data[(all_data["annee_scolaire"] != "2019-2020") & (all_data["annee_scolaire"] != "2018-2019")]
I replaced this line by:
remove_real_lines = all_data[(all_data["annee_scolaire"] != "2020-2021")]
and re-launched the model for the same Sept.-Dec 2021 period. The model outputs only slightly changed (see detailed results after code modification.
I'm afraid this is not the source of the prediction errors since Sept. 2021.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants