Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scarliles/splitter injection #61

Open
wants to merge 20 commits into
base: submodulev3
Choose a base branch
from

Conversation

SamuelCarliles3
Copy link
Collaborator

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Implements ability to inject split accept/reject conditions into Splitter from python.

Any other comments?

Function signatures and code organization are expected to undergo further refactoring.

Copy link

github-actions bot commented Mar 18, 2024

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


cython-lint

cython-lint detected issues. Please fix them locally and push the changes. Here you can see the detected issues. Note that the installed cython-lint version is cython-lint=0.16.0.


/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:50:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:60:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:63:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:101:24: E261 at least two spaces before inline comment
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:124:24: E261 at least two spaces before inline comment
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:145:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:173:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:177:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:201:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:205:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:332:9: dangerous default value!
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:333:9: dangerous default value!
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:383:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:387:5: E303 too many blank lines (2)
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:392:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:404:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:414:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:839:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:845:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:868:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:871:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:875:1: W293 blank line contains whitespace

Generated for commit: 9d6091a. Link to the linter CI: here

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few initial thoughts:

  1. If this has negligible performance issues compared to scikit-learn fork and scikit-learn main (then I assume it would also for scikit-tree), then very cool and I think this is the right way to go. So my immediate thought is compare the runtime performance vs n_samples and n_dimensions for 100, 500 and 1000 trees. This is the probably the most important asap to determine if this is a suitable route. excited to chat about that.

  2. If there are no performance issues, then the next question I have is how do we make this as usable as possible? We would like to ideally do some checks upon instantiation of the Splitter class, so that way it determines if any kind of SplitConditionTuple is valid and will not result in a seg fault/etc. Rn, it seems very easy to do so potentially, so we can construct some guardrails perhaps? Idk how to best do this atm though.

@SamuelCarliles3
Copy link
Collaborator Author

2. If there are no performance issues, then the next question I have is how do we make this as usable as possible? We would like to ideally do some checks upon instantiation of the Splitter class, so that way it determines if any kind of SplitConditionTuple is valid and will not result in a seg fault/etc. Rn, it seems very easy to do so potentially, so we can construct some guardrails perhaps? Idk how to best do this atm though.

To be explicit, there are two sus moves happening in this design pattern:

  1. the unencapsulated existence of the underlying condition functions
  2. the type unsafety of the condition parameter struct payloads that have to be cast inside the condition functions

These are what I would characterize as necessary evils resulting from technical constraints of cython -- the conditions as executed in say node_split_best can't be cython extension types because a) cpp template containers can't hold cython extension types, and b) cython extension types can't be selected out of a python iterable in a nogil block/function. If we want encapsulation, our options are to write the whole shebang in native cpp and wrap it, or to accept the gil, or to forego dynamic (from c/cpp pov) constraint count. The extension type/nogil tradeoff similarly applies to the parameter struct typecast situation.

So my intent with the wrapper classes was that those (function, parameter struct) tuple structs never be created by hand; when you want to add a new type of condition, you write the condition function, the function-specific parameter payload struct, and the extension type wrapper that provides a usable python interface, and the condition functions and parameter structs are never called by hand. Once the wrapper class is there it can be used dynamically in any python context without having to disturb existing cython code.

To mitigate the aforementioned shortcomings only two things immediately come to my mind:

  1. documenting the design pattern with the why and the how and the why-not-X
  2. adding underscores to the condition function names (and possibly the payload structs? is that legal cython?)

I am definitely open to other suggestions.

@adam2392
Copy link
Collaborator

Say we implement the template in C++. Would it be a lot you think? I'm not opposed to supporting c++ code as long as it's short and enables stuff we otw can't do easily with Cython alone.

If uncertain, then I'm okay keeping the current design for now. I'd rather get something working and benchmarked first.

@SamuelCarliles3
Copy link
Collaborator Author

My personal opinion: we're using a duck-typed language to wrap the mother of all unsafe languages. It's the programming equivalent of duct-taping a rocket launcher to a table saw. I think it's fair to do the two mitigations I proposed (and any other simple ones that come to mind) and accept that future devs have some responsibility to know what they're doing.

@adam2392
Copy link
Collaborator

Fair point. Let's see how the benchmarks turn out then with the current approach!

@adam2392
Copy link
Collaborator

On additional note: It would also be good to confirm this parallelizes fine: #61 (review)

E.g. n_jobs = 1 vs >1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants