The main purpose of this package is to provide a Python AutoML class named AML that covers the complete pipeline of a binary classification project from a raw dataset to a deployable model. It can be used as a functions/classes catalogue to ease and speed-up Data Scientists repetitive dev tasks aswell.
You can find the whole code documentation on MLBG59 Readthedoc documentation
- Python 3.7
- pandas 1.0.1
- torch
- xgboost
Since this package is uploaded to PyPI, it can be installed with pip using the terminal :
$ pip install AutoMxL
AML is built as a class inherited from pandas DataFrame. Each Machine Learning step corresponds to a method that can be called with default or filled parameters.
Note : For each method, verbose parameter allows you to get logging informations.
If needed, you can find in Start sub-package functions that facilitate data loading and target encoding.
# import package
from AutoMxL import *
# import data into DataFrame with delimiter auto-detection for csv and txt files
df_raw = import_data('data/bank-additional-full.csv', verbose=False)
# set "yes" category from variable "y" as the classification target.
# => get modified dataset and new target name
df, target = category_to_target(df_raw, var='y' , cat='yes')
# instantiate AML object with dataset and target name
auto_df = AML(df, target=target)
explore method gives you global information about the dataset and automatically identify features types (booleans, dates, verbatims, categoricals, numericals). This information is stored in "d_features" attribute.
auto_df.explore(verbose=False)
print(auto_df.d_features.keys())
> output : dict_keys(['date', 'identifier', 'verbatim', 'boolean', 'categorical', 'numerical', 'NA', 'low_variance'])
preprocess method prepares the data before feeding it to the model :
- removes features with low variance and features identified as verbatims and identifiers
- transforms date features to numeric data (timedelta, ...)
- fills missing values
- processes categorical data (using one hot encoding or Pytorch §NN embedding encoder)
- processes outliers (optional)
auto_df.preprocess(process_outliers=False, cat_method='encoder', verbose=False)
select_features method reduces the features dimension to speed up the modelisation execution time (may increase model performance aswell).
auto_df.select_features(verbose=False)
model_train_test method trains and test models with random search.
- creates models with random hyper-parameters combinations from HP grid
- splits (random 80/20) train/test sets to fit/apply models
- identifies valid models |(auc(train)-auc(test)|<0.03
- gets the best model in respect of a selected metric among valid model
Available classifiers : Random Forest, XGBOOST (and bagging).
d_fitted_models, l_valid_models, best_model_idx, df_model_res = auto_df.model_train_test(verbose=False)
output :
- d_fitted_models: dict containing models and information on test set
- l_valid_models: valid model indexes
- best_model_idx: best model index
- df_model_res: models information and metrics stored in DataFrame
Note : if you prefer to train and test your model separately, you can also use the following modelisation methods:
auto_df.model_train(verbose=False)
d_fitted_models, l_valid_models, best_model_idx, df_model_res = auto_df.model_apply(df_sel, verbose=False)
Once you have applied preprocess and select_features, you can apply the same transformations to any iso-structure dataset using following methods:
df_prep = auto_df.preprocess_apply(df, verbose=False)
df_sel = auto_df.select_features_apply(df_prep, verbose=False)
Since AML is pandas DataFrame inherited class, you can apply any DataFrame methods on it.
Note : copy() method applied on AML object will return a DataFrame. If you need to make a copy of AML object, use duplicate() method instead.
- 1.0.0 : First proper release
- Regression and multi-class classification
Distributed under the MIT license. See License.txt for more information
Maxence Labesse - [email protected]