Handle Project producing zero columns #912

hirzel · 2021-12-10T18:18:34Z

It would be nice if the user could provide a pipeline with more preprocessing subpipelines than necessary. For example, if a pipeline contains a branch with one-hot encoding for string columns, but the data only has numeric columns, it would be convenient if it worked anyway. Unfortunately, some sklearn operators raise an exception when their input data has zero columns. This issue proposes preventing that exception during fit, and possibly even pruning them from the pipeline returned by fit.

Example:

import sklearn.datasets
X, y = sklearn.datasets.load_digits(return_X_y=True)

from lale.lib.lale import Project, ConcatFeatures
from lale.lib.sklearn import LogisticRegression, OneHotEncoder

proj_nums = Project(columns={"type": "number"})
proj_cats = Project(columns={"type": "string"})
one_hot = OneHotEncoder(handle_unknown="ignore")
prep = (proj_nums & (proj_cats >> one_hot)) >> ConcatFeatures
trainable = prep >> LogisticRegression()

print(f"shapes: X {X.shape}, y {y.shape}, "
      f"nums {proj_nums.fit(X).transform(X).shape}, "
      f"cats {proj_cats.fit(X).transform(X).shape}")

trained = trainable.fit(X, y)

This prints:

shapes: X (1797, 64), y (1797,), nums (1797, 64), cats (1797, 0)
Traceback (most recent call last):
  File "~/tmp.py", line 17, in <module>
    trained = trainable.fit(X, y)
  File "~/git/user/lale/lale/operators.py", line 3981, in fit
    trained = trainable.fit(X=inputs)
  File "~/git/user/lale/lale/operators.py", line 2526, in fit
    trained_impl = trainable_impl.fit(X, y, **filtered_fit_params)
  File "~/git/user/lale/lale/lib/sklearn/one_hot_encoder.py", line 145, in fit
    self._wrapped_model.fit(X, y)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 385, in fit
    self._fit(X, handle_unknown=self.handle_unknown)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 74, in _fit
    X_list, n_samples, n_features = self._check_X(X)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 43, in _check_X
    X_temp = check_array(X, dtype=None)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 661, in check_array
    context))
ValueError: Found array with 0 feature(s) (shape=(1797, 0)) while a minimum of 1 is required.

The text was updated successfully, but these errors were encountered:

ksrinivs64 · 2021-12-10T20:55:00Z

Martin, we are exploring if we can add constraints to the planner after using the Lale Project operators to customize the search space for the dataset's characteristics. If that works out, this has lower priority. However we very much would like the ability to project text. Thanks much!

rithram · 2021-12-10T21:02:50Z

One thing that is not clear to me is what is the expected behaviour here. scikit-learn's answer is to explicitly fail because we are doing something that is not valid here. Do we want to automatically correct the pipeline in a data-dependent manner?

Also +1 on text and maybe datetime. I wonder what pandas data types we can leverage here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Project producing zero columns #912

Handle Project producing zero columns #912

hirzel commented Dec 10, 2021

ksrinivs64 commented Dec 10, 2021

rithram commented Dec 10, 2021

Handle Project producing zero columns #912

Handle Project producing zero columns #912

Comments

hirzel commented Dec 10, 2021

ksrinivs64 commented Dec 10, 2021

rithram commented Dec 10, 2021