Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup of Scikit-learn Experiments #4

Open
janvanrijn opened this issue Jun 14, 2018 · 4 comments
Open

Setup of Scikit-learn Experiments #4

janvanrijn opened this issue Jun 14, 2018 · 4 comments

Comments

@janvanrijn
Copy link
Member

The great news:
scikit-learn/scikit-learn#9012

We won't need conditional imputer anymore.

The equally promising news:
scikit-learn/scikit-learn#11190

Sklearn might be able to select categorical and numeric features itself.

The turn side: We will need two imputation components in our flow, which is not supported by OpenML. How are we going to deal with this. There are some possibilities, but what are your thoughts?

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Jun 14, 2018 via email

@mfeurer
Copy link
Collaborator

mfeurer commented Jun 14, 2018

We won't need conditional imputer anymore.

The feature isn't released yet, therefore, I vote to not use it yet.

Sklearn might be able to select categorical and numeric features itself.

How would this information be passed to scikit-learn? By providing pandas arrays? It should still be possible to do this manually.

How are we going to deal with this.

I would love to see this, especially given the fact that the imputer just got a lot better; also given the fact that Joaquin's student wants to parametrize neural networks which would become a lot easier if this feature existed. My suggestion:

  • As every sub-component has a unique name, this name could be used as an identifier
  • For a run, it would require adding the name field to the parameter settings field.
    In my naive thinking it should also be possible to automatically upgrade the database with a script. The webserver could also be backwards compatible for a while if it accepts an XML file without the name and automatically adds a default value (then the component can only be used once), and that backward compatibility could be dropped at some point when the Java, Python and R API are updated.

@janvanrijn
Copy link
Member Author

The feature isn't released yet, therefore, I vote to not use it yet.

If I am not mistaken it is currently in the master branch.

How would this information be passed to scikit-learn? By providing pandas arrays? It should still be possible to do this manually.

I assume so. Yet, it would be great if this feature could be used, as this would give all flows a single setup id (currently, a single flow can get different setup ids on different tasks due to different categorical_feature values, which is a big complicating factor when re-using the experiment)

The fact that we can’t support pipelines with multiple instances of the same algorithm seems to really hold us back.

I completely agree that this should be a priority to improve on. However, I am very skeptical that this goal will be achieved on short term, or at least on the term for which we want to start the benchmark study. Furthermore, the decision how to approach this will have long term implications on OpenML, I would strongly suggest that we do it the proper way this time, rather than a quick and dirty patch.

That is why I opened this issue, to discuss how we are going to deal with this on short term for the benchmark study.

@janvanrijn
Copy link
Member Author

I would like to note that scikit-learn version 0.20 and 0.21 will be the golden opportunity to perform this experiment. We don't only have access to the SimpleImputer (new Imputation class) we also have access to the old deprecated Imputer, which allows us to do all the experimentation without adding 'dummy wrapper classes'

Pipeline doesn't need to have a second dummy wrapper class, as all pipelines used in the experiment will have a different name in OpenML, and thus are considered different parts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants