-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for XGBoost and LightGBM #165
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This test is currently failing if you have xgboost installed. If you don't have xgboost installed, it skips itself to prevent failures due to missing packages and dependencies.
…ype xgboost This is only possible when we have the xgboost module, so raise an error if that is not present.
…odel This test is failing right now because we also need pyarrow>=4 when using xgboost. We should add this as a dependency in the xgboost extra. If xgboost isn't installed, this test skips itself.
…tras This should let us have two different test setups for each Python version. One with xgboost, one without.
I've also updated pytest to be more verbose for clarity.
Like some of the other models, xgboost returns an array of probabilities like [probability_no, probability_yes]. So we extract just probability_yes as our probability for hlink purposes.
xgboost has a different setup for feature importances, so the current logic ignores it. We'll need to update the save model metadata step to include logic specifically for xgboost.
This is really different from the Spark models, so I've made it a special case instead of trying to integrate it with the previous logic closely. This section might be due for some refactoring now.
This also updates Alabaster to 1.0.0.
This installs SynapseML, Microsoft's Apache Spark integrations package. It has a synapse.ml.lightgbm module which we can use for LightGBM-PySpark integration.
This is currently failing.
…tras This should let us have two different test setups for each Python version. One with xgboost, one without.
One of these is failing because there's a bug where LightGBM throws an error on interacted features.
Usually we don't care about the names of the vector attributes. But LightGBM uses them as feature names and disallows some characters in the names. Unfortunately, one of these characters is :, and Spark's Interaction names the output of an interaction between A and B "A:B". I looked through the Spark code and didn't see any way to configure the names of these output features. So I think the easiest way forward here is to make a transformer that renames the attributes of a vector by removing some characters and replacing them with another.
… some extra params
The bug was that we didn't propagate the metadata changes into Java, so they weren't persistent in something like a Pipeline. By calling withMetadata(), we should now be persisting our changes correctly.
…tion output for LightGBM
…e post-transformer
…eVectorAttributes
This should hopefully let executors find the jars as well as the driver. I've added some comments because this is a bit gnarly.
This is a resource directory for the Metals Scala LSP provider.
The Spark Bucketizer adds commas to vector slot names, which cause problems with LightGBM later in the pipeline. This is similar to the issue with colons for Interaction, but the metadata for bucketized vectors is a little bit different. So RenameVectorAttributes needed to change a bit to handle the two different forms of metadata.
Generally clean up some small mistakes. I also added a comment to the logic that removes the commas in core/pipeline.py.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR closes #161 and closes #162 by adding support for the XGBoost and LightGBM machine learning models.
We've opted to make both of these models available through optional installation with the Python extras syntax. The XGBoost-PySpark integration is still experimental/unstable, and LightGBM has some significant Scala dependencies, so we thought it best to let users decide whether they wanted to bother with these models and their dependencies or not. This means that this PR should be entirely additive, and current users should not see any changes in hlink unless they start using XGBoost and/or LightGBM.
You can install the Python dependencies of these packages with
pip
.pip install hlink[xgboost] pip install hlink[lightgbm] # You can install both at once too pip install hlink[xgboost,lightgbm]
In the code, we detect whether these features are available by trying to import their respective Python packages and setting a flag to indicate whether the import succeeded or failed.
This has the nice benefit of keeping all of our imports at the top of the file, not in functions or methods. If a user tries to use XGBoost or LightGBM without the required Python package installed, we throw an error with a (hopefully helpful) detailed error message.
Finally, the tests also detect whether the needed packages are installed, and they use
pytest.mark.skipif
to skip themselves if the packages aren't available. We've updated the CI/CD matrix to run with and without these dependencies. So now there are 6 total test CI/CD runs:Code Changes Needed for the New Models
For the most part, the new models integrate cleanly with the existing code. There were a few areas where we had to make some updates.
Interaction
class automatically adds colons to its output vector's attribute names. Eventually these colons end up in the names of the output column of theVectorAssembler
. This caused LightGBM to fail with an error. So we've made a newRenameVectorAttributes
class which transforms a vector column by replacing strings in its attribute names. We now add aRenameVectorAttributes
object to the pipeline each time we add anInteraction
, so that we can eliminate the colons from the attribute names.hlink.spark.session
, we now install LightGBM's Scala dependencies if thesynapse.ml
Python package is present, using theADD JAR
SQL command. I had trouble using thespark.jars.packages
configuration setting, although it really seemed like it should have worked.hlink.tests.markers
module withrequires_xgboost
andrequires_lightgbm
markers (decorators). These automatically skip the decorated tests when the required Python packages are not present. This works nicely for testing both locally and CI/CD, and if you run withpytest -ra
, it will tell you why it skipped the tests:To Do Items (Nice to Have)