Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data validation error when using Buckingham's Pi Theorem on Classification task #19

Open
aclemente-bigml opened this issue Jan 29, 2021 · 1 comment

Comments

@aclemente-bigml
Copy link

Hi!
While trying to use the AutoFeatClassifier using units, I stumbled upon a validation error caused by an infinite value.
Presumably one of the generated features (I assume from the ones coming from the Pi theorem) has an infinite value, which breaks the StandardScaler used during the filtering of correlated features.
This is how I am calling the classifier, with fitting the training data that comes in a numpy ndarray

auto = AutoFeatClassifier(categorical_cols=categorical_cols, units=units, verbose=1, feateng_steps=3, featsel_runs=5, n_jobs=5, apply_pi_theorem=True)
X_train_new = auto.fit_transform(X_train_sampled, y_train_sampled)

These are the features logged for the Pi Theorem, and all of them include divisions (which could lead to a division by 0 issue).

...
[AutoFeat] Applying the Pi Theorem
[AutoFeat] Pi Theorem 1:  x002 / x001
[AutoFeat] Pi Theorem 2:  x006 / x000
[AutoFeat] Pi Theorem 3:  x010 / x005
[AutoFeat] Pi Theorem 4:  x003 / x001
[AutoFeat] Pi Theorem 5:  x013 / x001
[AutoFeat] Pi Theorem 6:  x014 / x005
[AutoFeat] Pi Theorem 7:  x000 * x005 * x012 / x015
[AutoFeat] Pi Theorem 8:  x016 / x000
[AutoFeat] Pi Theorem 9:  x017 / x012
...

The full logs output by a failing run is the following:

[AutoFeat] Applying the Pi Theorem
[AutoFeat] Pi Theorem 1:  x002 / x001
[AutoFeat] Pi Theorem 2:  x006 / x000
[AutoFeat] Pi Theorem 3:  x007 / x005
[AutoFeat] Pi Theorem 4:  x003 / x001
[AutoFeat] Pi Theorem 5:  x009 / x001
[AutoFeat] Pi Theorem 6:  x010 / x005
[AutoFeat] Pi Theorem 7:  x000 * x005 * x008 / x011
[AutoFeat] Pi Theorem 8:  x012 / x000
[AutoFeat] Pi Theorem 9:  x013 / x008
[AutoFeat] The 3 step feature engineering process could generate up to 118923 features.
[AutoFeat] With 121 data points this new feature matrix would use about 0.06 gb of space.
[feateng] Step 1: transformation of original features
[feateng] Generated 40 transformed features from 14 original features - done.
[feateng] Step 2: first combination of features
[feateng] Generated 1524 feature combinations from 1431 original feature tuples - done.
[feateng] Step 3: transformation of new features
[feateng] Generated 4564 transformed features from 1524 original features - done.
[feateng] Generated altogether 6233 new features in 3 steps
[feateng] Removing correlated features, as well as additions at the highest level

And after that, the error is reported with the following stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-323-53dcdfc1b68e> in <module>
     32     # categorical_cols = []
     33     auto = AutoFeatClassifier(categorical_cols=categorical_cols, units=units, verbose=1, feateng_steps=3, featsel_runs=5, n_jobs=5, apply_pi_theorem=True)
---> 34     X_train_new = auto.fit_transform(X_train_sampled, y_train_sampled)
     35     X_test_new = auto.transform(X_test.to_numpy())
     36     pretty_names = feature_names(auto, USEFUL_ACTUALS)

~/.pyenv/versions/features/lib/python3.7/site-packages/autofeat/autofeat.py in fit_transform(self, X, y)
    299         # generate features
    300         df_subs, self.feature_formulas_ = engineer_features(df_subs, self.feateng_cols_, _parse_units(self.units, verbose=self.verbose),
--> 301                                                             self.feateng_steps, self.transformations, self.verbose)
    302         # select predictive features
    303         if self.featsel_runs <= 0:

~/.pyenv/versions/features/lib/python3.7/site-packages/autofeat/feateng.py in engineer_features(df_org, start_features, units, max_steps, transformations, verbose)
    354     if cols:
    355         # check for correlated features again; this time with the start features
--> 356         corrs = dict(zip(cols, np.max(np.abs(np.dot(StandardScaler().fit_transform(df[cols]).T, StandardScaler().fit_transform(df_org))/df_org.shape[0]), axis=1)))
    357         cols = [c for c in cols if corrs[c] < 0.9]
    358     cols = list(df_org.columns) + cols

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    688         if y is None:
    689             # fit method of arity 1 (unsupervised transformation)
--> 690             return self.fit(X, **fit_params).transform(X)
    691         else:
    692             # fit method of arity 2 (supervised transformation)

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
    665         # Reset internal state before fitting
    666         self._reset()
--> 667         return self.partial_fit(X, y)
    668 
    669     def partial_fit(self, X, y=None):

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y)
    696         X = self._validate_data(X, accept_sparse=('csr', 'csc'),
    697                                 estimator=self, dtype=FLOAT_DTYPES,
--> 698                                 force_all_finite='allow-nan')
    699 
    700         # Even in the case of `with_mean=False`, we update the mean anyway

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    418                     f"requires y to be passed, but the target y is None."
    419                 )
--> 420             X = check_array(X, **check_params)
    421             out = X
    422         else:

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    643         if force_all_finite:
    644             _assert_all_finite(array,
--> 645                                allow_nan=force_all_finite == 'allow-nan')
    646 
    647     if ensure_min_samples > 0:

~/.pyenv/versions/features/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
     97                     msg_err.format
     98                     (type_err,
---> 99                      msg_dtype if msg_dtype is not None else X.dtype)
    100             )
    101     # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains infinity or a value too large for dtype('float64').

I tried removing all the constant features from the original dataset, so that all the original features have std() > 0.
Looks like a feature generated has a division by zero somewhere that leads to an infinite value somewhere deep in the generated features.
Maybe there should be some handling there, ignoring the feature or replacing the infinites with NaN which the scalers know to ignore?

@cod3licious
Copy link
Owner

Thanks for flagging this - I'll have a look & add some extra checks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants