improve genericty of statistical features #6

nicolas-vtg · 2022-01-26T17:09:38Z

In app/preprocess.py / the function used to compute statistical features (average number of individuals) needs to be improved.
(TODO of line 200)
Indeed, to make those calculus consistent, some data is filtered, nevertheless, the filter is an harcoded list of school year.
This filter is used to prevent to compute statistical features using the target values when using the repository on past data.
Nevertheless, it needs to be generalized.

To investigate, one could update this filter based on the data that is currently targeted.
For instance, if the user is predicting for school year 2021-2022, then data from '2021-2022' should be removed, but prior data can be kept ('2018-2019', '2019-2020', '2020-2021') since they represent the past.

    remove_real_lines = all_data[(all_data["annee_scolaire"] != "2021-2022")]

A fix would be to define the list of data to exclude based on parameters used to call the app in order to make it consistent for both

using the model on past value to analyse its performance
using the model with future data

def add_statistical_features(all_data, list_of_period_to_exclude):
    """
    compute statistical features using ratio, means etc
    """
    # TODO improve filtering here and remove NANs
    remove_real_lines = all_data[~(all_data["annee_scolaire"].isin(list_of_period_to_exclude)]

where list_of_period_to_exclude is a list of school_year (maybe a range of dates could be a nice evolution too since it would allow to consider recent data to update those features ?) that has been computed prior to the call to this function.

The text was updated successfully, but these errors were encountered:

fBedecarrats · 2022-01-27T22:01:22Z

The model produces reliable forecasts until the end of school year 2020-2021 (~16 000 meals/day). But it it starts generating very low values (~7000 meals/day) from Sept. 2021 (see detailed outputs for Sept.-Dec. 2021.
The initial function with the following at line 201:
remove_real_lines = all_data[(all_data["annee_scolaire"] != "2019-2020") & (all_data["annee_scolaire"] != "2018-2019")]
I replaced this line by:
remove_real_lines = all_data[(all_data["annee_scolaire"] != "2020-2021")]
and re-launched the model for the same Sept.-Dec 2021 period. The model outputs only slightly changed (see detailed results after code modification.
I'm afraid this is not the source of the prediction errors since Sept. 2021.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve genericty of statistical features #6

improve genericty of statistical features #6

nicolas-vtg commented Jan 26, 2022

fBedecarrats commented Jan 27, 2022

improve genericty of statistical features #6

improve genericty of statistical features #6

Comments

nicolas-vtg commented Jan 26, 2022

fBedecarrats commented Jan 27, 2022