Setup of Scikit-learn Experiments #4

janvanrijn · 2018-06-14T01:07:37Z

The great news:
scikit-learn/scikit-learn#9012

We won't need conditional imputer anymore.

The equally promising news:
scikit-learn/scikit-learn#11190

Sklearn might be able to select categorical and numeric features itself.

The turn side: We will need two imputation components in our flow, which is not supported by OpenML. How are we going to deal with this. There are some possibilities, but what are your thoughts?

joaquinvanschoren · 2018-06-14T01:26:36Z

The fact that we can’t support pipelines with multiple instances of the same algorithm seems to really hold us back. Probably the best way is to support it in a version 2 of the API, and then let the cliënt API’s adapt to that as soon as they can?

On Wed, 13 Jun 2018 at 21:07, janvanrijn ***@***.***> wrote: The great news: scikit-learn/scikit-learn#9012 <scikit-learn/scikit-learn#9012> We won't need conditional imputer anymore. The equally promising news: scikit-learn/scikit-learn#11190 <scikit-learn/scikit-learn#11190> Sklearn might be able to select categorical and numeric features itself. The turn side: We will need two imputation components in our flow, which is not supported by OpenML. How are we going to deal with this. There are some possibilities, but what are your thoughts? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABpQVy5PCdFF42iplE0-28eaxt6L5c_rks5t8bdZgaJpZM4UnKPk> .

-- Thank you, Joaquin

mfeurer · 2018-06-14T16:12:07Z

We won't need conditional imputer anymore.

The feature isn't released yet, therefore, I vote to not use it yet.

Sklearn might be able to select categorical and numeric features itself.

How would this information be passed to scikit-learn? By providing pandas arrays? It should still be possible to do this manually.

How are we going to deal with this.

I would love to see this, especially given the fact that the imputer just got a lot better; also given the fact that Joaquin's student wants to parametrize neural networks which would become a lot easier if this feature existed. My suggestion:

As every sub-component has a unique name, this name could be used as an identifier
For a run, it would require adding the name field to the parameter settings field.
In my naive thinking it should also be possible to automatically upgrade the database with a script. The webserver could also be backwards compatible for a while if it accepts an XML file without the name and automatically adds a default value (then the component can only be used once), and that backward compatibility could be dropped at some point when the Java, Python and R API are updated.

janvanrijn · 2018-06-14T19:36:51Z

The feature isn't released yet, therefore, I vote to not use it yet.

If I am not mistaken it is currently in the master branch.

How would this information be passed to scikit-learn? By providing pandas arrays? It should still be possible to do this manually.

I assume so. Yet, it would be great if this feature could be used, as this would give all flows a single setup id (currently, a single flow can get different setup ids on different tasks due to different categorical_feature values, which is a big complicating factor when re-using the experiment)

The fact that we can’t support pipelines with multiple instances of the same algorithm seems to really hold us back.

I completely agree that this should be a priority to improve on. However, I am very skeptical that this goal will be achieved on short term, or at least on the term for which we want to start the benchmark study. Furthermore, the decision how to approach this will have long term implications on OpenML, I would strongly suggest that we do it the proper way this time, rather than a quick and dirty patch.

That is why I opened this issue, to discuss how we are going to deal with this on short term for the benchmark study.

janvanrijn · 2018-09-13T20:52:51Z

I would like to note that scikit-learn version 0.20 and 0.21 will be the golden opportunity to perform this experiment. We don't only have access to the SimpleImputer (new Imputation class) we also have access to the old deprecated Imputer, which allows us to do all the experimentation without adding 'dummy wrapper classes'

Pipeline doesn't need to have a second dummy wrapper class, as all pipelines used in the experiment will have a different name in OpenML, and thus are considered different parts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup of Scikit-learn Experiments #4

Setup of Scikit-learn Experiments #4

janvanrijn commented Jun 14, 2018

joaquinvanschoren commented Jun 14, 2018 via email

mfeurer commented Jun 14, 2018

janvanrijn commented Jun 14, 2018

janvanrijn commented Sep 13, 2018

Setup of Scikit-learn Experiments #4

Setup of Scikit-learn Experiments #4

Comments

janvanrijn commented Jun 14, 2018

joaquinvanschoren commented Jun 14, 2018 via email

mfeurer commented Jun 14, 2018

janvanrijn commented Jun 14, 2018

janvanrijn commented Sep 13, 2018