Sklearn HistGradientBoostingRegressor/Classifier Instead of DecisionTreeRegressor/Classifier #267

kmedved · 2021-06-14T18:01:59Z

kmedved
Jun 14, 2021

One of the best parts about ngboost is that it exposes the base learner to the user. It has occurred to me that this makes material speedups available in training if you replace the current Sklearn DecisionTreeRegressor as the default base learner with something faster. The issue is essentially that DecsionTreeRegressor is written in python, so it is slow. By using a more optimized base learner, such as one written in C (with a python wrapper), or by using numba, it should be possible to achieve significant speedups in training speeds without sacrificing functionality.

In that vein, I have been experimenting with using LightGBM, Catboost, and HistGradientBoostingRegressor, all with a n_estimators setting of 1, to achieve this. The first two are written in C, while HistGradientBoostingRegressor is written in numba (thus offering C-like speeds, but in pure python). The result, based on my early testing are as expected, offering speedups of 2.5x (Catboost) to 5x (LightGBM) relative to DecsionTreeRegressor on a 10K row, 20 feature-dataset. There may also be performance benefits, but I'm less confident here since I haven't done much tuning in this testing.

HistGradientBoostingRegressor in particular seems appealing in this respect, since it's included in scikit-learn already, although each of the three above offer some benefits, such a missing data handling, categorical variable support, multiprocessing, GPU support (probably not useful here), more loss functions, and a wider range of tunable hyperparameters.

Catboost, while the slowest of the 3, is also interesting since it has very good/dynamic default hyperparameters, thus potentially offering Ngboost strong 'out of the box' performance.

I am interested in whether there are any concerns with this approach (theoretical or practical) that I'm missing (particularly with respect to the histogram-nature of these boosting algorithms). Potentially it would make sense to add HistGradientBoostingRegressor in particular to the documentation as a way to speed up training for users if this makes sense. Obviously users are free to arrive at this decision themselves, but it's somewhat counterintuitive to use a boosting library without any boosting, so if this approach has merit, a suggestion may be helpful. This would help address perhaps the most common question I see re: ngboost around speed-of-training.

Interested in thoughts/concerns.

alejandroschuler · 2021-06-15T20:29:34Z

alejandroschuler
Jun 15, 2021
Maintainer

I think that's a great idea. I don't know too much about HistGradientBoostingRegressor but it seems like the only conceptual difference is that it bins the features instead of leaving them continuous. That's totally fine and doesn't change anything about NGBoost except maybe sacrifice a tiny bit of performance in some cases. I think that's fine for a default.

I want to clarify something, though. When you say you use these methods with n_estimators set to 1, do you mean something like
(A):NGBoostRegressor(Base=HistGradientBoostingRegressor(n_estimators=1)) or (B):NGBoostRegressor(Base=HistGradientBoostingRegressor(), n_estimators=1)? Option (A) makes sense, but option (B) is not a good idea because you're effectively estimating each distributional parameter totally separately in a single shot.

If you want to see significant speedups along the lines of option (A), it would make the most sense to do the binning/sorting/etc. that the fast boosting algorithms do upfront, that is, before fitting any trees. Option (A) will repeat this processing every time a tree is fit. Implementing that is basically a matter of reverse-engineering lightgbm, etc. and incorporating those tricks into the core ngboost code in such a way that keeps the modularity with respect to base learners (i.e. the optimizations are ignored if the user choses not to use trees). Alternatively, one might fork the ngboost codebase into something like a fast-ngboost project and rewrite it entirely in C++ (e.g.) with histogram trees baked in as the only option for base learners. The latter is a lot more work but even the former would require a reasonable amount of work with the internals of the current code.

3 replies

kmedved Jun 15, 2021
Author

Thanks for the feedback.

HistGradientBoostingRegressor is essentially scikit-learn's attempt to port LightGBM over directly, and to do so in numba-JIT-compiled Python, rather than C. That has allowed them to largely match LightGBM's speed without requiring any C/C++ code support. If you're interested in more info, there was a precursor project here before it was merged into scikit-learn directly.

To answer your question. yes - I meant Option A here, so it would be NGBoostRegressor(Base=HistGradientBoostingRegressor(n_estimators=1)), and then the user would decide the number of NGBoost rounds however they normally would. The idea is purely to exploit the fact that the Scitkit-learn team has written a very fast decision tree in HistGradientBoostingRegressor, as opposed to the current default DecisionTreeRegressor (which is slow, since it's pure python, without the JIT numba compilation).

I agree with your concern about rebinning/sorting every time. I have done some further testing with this approach, and likewise found that there is significant overhead associated with any of these estimators (e.g., running HistGradientBoostingRegressor(n_estimators=1) on the Boston Housing data takes ~41ms, while HistGradientBoostingRegressor(n_estimators=2) takes ~42ms or something, suggesting ~40ms is spent binning). On the other hand, even with this overhead cost, all of these fast-decision tree models are materially faster than using DecisionTreeRegressor. (About 2-5x in my testing on the Boston Housing data, but this is very dataset dependent).

However you're right - if we had a dedicated numba or C-written DecisionTreeRegressor, the speedup could be 20x, 30x, or more etc... rather than "just" 2-5x. The nice part of the HistGradientBoostingRegressor approach is that scikit-learn is already an ngboost dependency, so this doesn't require adding or maintaining any new code, or adding any dependencies to the package. It can potentially deliver a good-size speedup, while also supporting some other HistGradientBoostingRegressor-niceties, like missing value support, and categorial feature support.

alejandroschuler Jun 15, 2021
Maintainer

Yeah, I agree with you, it seems like a great approach. Easy fix to implement in the meantime until we get around to deconstructing HistGradientBoostingRegressor and allowing for the preprocessing to be done upfront. If you put in a PR I'll approve it! :)

kmedved Jun 16, 2021
Author

Do you mean a PR suggesting the change for the documentation, or actually changing the default base learner to be HistGradientBoostingRegressor?

alejandroschuler · 2021-06-16T15:15:50Z

alejandroschuler
Jun 16, 2021
Maintainer

Either or both would be good :)

…

On Wed, Jun 16, 2021, 8:15 AM kmedved ***@***.***> wrote: Do you mean a PR suggesting the change for the documentation, or actually changing the default base learner to be HistGradientBoostingRegressor? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#267 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXBAJ7XM3CEKUIDMTJXKA3TTC5XNANCNFSM46VWPCOA> .

6 replies

alejandroschuler Jun 21, 2021
Maintainer

Hmm, I'm not sure why anyone would have code that depends specifically on the default base learner but I suppose it's possible? Like, they'd have to have code that digs into the internals of the fit ngboost models to access the sequence of base learners. I guess I don't particularly care to support that since those are attributes that would be hidden/protected from the public API if this weren't written in python. Maybe I'm missing some other corner case but this package is also pre-1.0 so breaking changes are to be expected anyways.

kmedved Jul 14, 2021
Author

Just to check in here, I'm still doing some testing on this. One issue I've run into is that HistGradientBoostingRegressor uses 100% of available CPU threads by default, and the scikit-learn team won't expose an n_jobs parameter to users. I've been getting around this locally by using threadpoolctl to limit the number of threads I give HistGradientBoostingRegressor access to (only a single thread), but implementing this solution as a default base learner is somewhat tricky (it may require writing a wrapper class).

Will do some more work on this.

kmedved Jul 16, 2021
Author

I think I have a fix for the above issue (it involves writing a custom wrapper around HistGradientBoostingRegressor), but am still doing testing.

Relatedly, @alejandroschuler - it seems to me like this solution may seemingly allow ngboost to accept categorical features or missing values. I believe ngboost has internal checks which don't allow for this because DecisionTreeRegressor doesn't support categorical features or missing values. However, is that necessary for some other part of the ngboost algorithm?

In other words, could we allow a user to ignore that missing values/categoricals check if they're using HistGradientBoostingRegressor as the base learner, or would that cause problems?

alejandroschuler Jul 29, 2021
Maintainer

Interesting re: the CPU issue... I'm not sure most users would mind that behavior so I'm ok with it as-is but of course feel free to do fancier stuff.

As for the categorical features and missing values, that's a good question. As long as the distributional parameters can be predicted in full from such data I don't think there is any issue.

kmedved Aug 16, 2021
Author

I have written a wrapper for HistGradientBoostingRegressor here which allows it to take an n_jobs parameter: https://gist.github.com/kmedved/7dfd72ef325585db3feabebba0ac472e. The CPU and the oversubscription issue is material (leading to massive slowdowns on machines with enough cores), but I think this should solve this.

I will work to integrate this, and see if I can also get the model to work with categorical feature/missing values if the base learner supports them.

Geethen · 2022-08-01T12:16:53Z

Geethen
Aug 1, 2022

Have there been any updates on this?

I have been trying to use NGBoost with a 5GB dataset but I have had no success. It results in timeout within colab pro and I have been hesitant to run this locally as it will hijack my pc for (seemingly) a couple of days.

0 replies

kmedved · 2022-08-01T14:09:14Z

kmedved
Aug 1, 2022
Author

I don't have a clear update here. Using HistGradientBoostingRegressor does speed things up some, but it's still fairly slow on larger datasets. The issue is that even with a single estimator, there is a lot of overhead associated with HistGradientBoostingRegressor. The ideal solution would be a numba accelerated DecisionTreeRegressor instead.

1 reply

Geethen Aug 2, 2022

I tried using the histgradientBoosting regressor on an ~400k row dataset, it resulted in a longer runtime than the standard NGBoost but higher accuracy and higher stdv. I will posts any updates here on any accelerations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sklearn HistGradientBoostingRegressor/Classifier Instead of DecisionTreeRegressor/Classifier #267

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Sklearn HistGradientBoostingRegressor/Classifier Instead of DecisionTreeRegressor/Classifier #267

kmedved Jun 14, 2021

Replies: 4 comments · 10 replies

alejandroschuler Jun 15, 2021 Maintainer

kmedved Jun 15, 2021 Author

alejandroschuler Jun 15, 2021 Maintainer

kmedved Jun 16, 2021 Author

alejandroschuler Jun 16, 2021 Maintainer

alejandroschuler Jun 21, 2021 Maintainer

kmedved Jul 14, 2021 Author

kmedved Jul 16, 2021 Author

alejandroschuler Jul 29, 2021 Maintainer

kmedved Aug 16, 2021 Author

Geethen Aug 1, 2022

kmedved Aug 1, 2022 Author

Geethen Aug 2, 2022

kmedved
Jun 14, 2021

Replies: 4 comments 10 replies

alejandroschuler
Jun 15, 2021
Maintainer

kmedved Jun 15, 2021
Author

alejandroschuler Jun 15, 2021
Maintainer

kmedved Jun 16, 2021
Author

alejandroschuler
Jun 16, 2021
Maintainer

alejandroschuler Jun 21, 2021
Maintainer

kmedved Jul 14, 2021
Author

kmedved Jul 16, 2021
Author

alejandroschuler Jul 29, 2021
Maintainer

kmedved Aug 16, 2021
Author

Geethen
Aug 1, 2022

kmedved
Aug 1, 2022
Author