Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for XGBoost and LightGBM #165

Merged
merged 48 commits into from
Dec 4, 2024
Merged

Add support for XGBoost and LightGBM #165

merged 48 commits into from
Dec 4, 2024

Conversation

riley-harper
Copy link
Contributor

@riley-harper riley-harper commented Nov 25, 2024

Overview

This PR closes #161 and closes #162 by adding support for the XGBoost and LightGBM machine learning models.

We've opted to make both of these models available through optional installation with the Python extras syntax. The XGBoost-PySpark integration is still experimental/unstable, and LightGBM has some significant Scala dependencies, so we thought it best to let users decide whether they wanted to bother with these models and their dependencies or not. This means that this PR should be entirely additive, and current users should not see any changes in hlink unless they start using XGBoost and/or LightGBM.

You can install the Python dependencies of these packages with pip.

pip install hlink[xgboost]
pip install hlink[lightgbm]
# You can install both at once too
pip install hlink[xgboost,lightgbm]

In the code, we detect whether these features are available by trying to import their respective Python packages and setting a flag to indicate whether the import succeeded or failed.

try:
    import synapse.ml.lightgbm
except ModuleNotFoundError:
    _lightgbm_available = False
else:
    _lightgbm_available = True

try:
    import xgboost.spark
except ModuleNotFoundError:
    _xgboost_available = False
else:
    _xgboost_available = True

This has the nice benefit of keeping all of our imports at the top of the file, not in functions or methods. If a user tries to use XGBoost or LightGBM without the required Python package installed, we throw an error with a (hopefully helpful) detailed error message.

ModuleNotFoundError: To use the 'lightgbm' model type, you need to install the synapseml
Python package, which provides LightGBM-Spark integration, and its dependencies. Try installing
hlink with the lightgbm extra:

                   pip install hlink[lightgbm]

Finally, the tests also detect whether the needed packages are installed, and they use pytest.mark.skipif to skip themselves if the packages aren't available. We've updated the CI/CD matrix to run with and without these dependencies. So now there are 6 total test CI/CD runs:

Python version hlink extras
3.10 dev
3.10 dev,xgboost,lightgbm
3.11 dev
3.11 dev,xgboost,lightgbm
3.12 dev
3.12 dev,xgboost,lightgbm

Code Changes Needed for the New Models

For the most part, the new models integrate cleanly with the existing code. There were a few areas where we had to make some updates.

  1. LightGBM does not support feature names with special characters like colons. Spark's Interaction class automatically adds colons to its output vector's attribute names. Eventually these colons end up in the names of the output column of the VectorAssembler. This caused LightGBM to fail with an error. So we've made a new RenameVectorAttributes class which transforms a vector column by replacing strings in its attribute names. We now add a RenameVectorAttributes object to the pipeline each time we add an Interaction, so that we can eliminate the colons from the attribute names.
  2. Both XGBoost and LightGBM have multiple feature importance metrics. All of the previous models have only had a single feature importance metric. We've chosen to use weight (number of splits) and gain (total gain across all splits made by the feature) for both XGBoost and LightGBM. This took some refactoring of training step 3 - save model metadata. It's still a little bit clunky, but is hopefully easy enough to read.
  3. In hlink.spark.session, we now install LightGBM's Scala dependencies if the synapse.ml Python package is present, using the ADD JAR SQL command. I had trouble using the spark.jars.packages configuration setting, although it really seemed like it should have worked.
  4. In the tests, we've made a new hlink.tests.markers module with requires_xgboost and requires_lightgbm markers (decorators). These automatically skip the decorated tests when the required Python packages are not present. This works nicely for testing both locally and CI/CD, and if you run with pytest -ra, it will tell you why it skipped the tests:
=================================================== short test summary info ===================================================
SKIPPED [1] hlink/tests/core/classifier_test.py:10: requires the lightgbm library
SKIPPED [1] hlink/tests/core/classifier_test.py:21: requires the xgboost library
SKIPPED [1] hlink/tests/integration_score_with_trained_models_test.py:497: requires the lightgbm library
SKIPPED [1] hlink/tests/integration_score_with_trained_models_test.py:956: requires the xgboost library
SKIPPED [1] hlink/tests/training_test.py:435: requires the lightgbm library
SKIPPED [1] hlink/tests/training_test.py:498: requires the lightgbm library
SKIPPED [1] hlink/tests/training_test.py:556: requires the xgboost library

To Do Items (Nice to Have)

  • See if we can reduce the verbosity of the Spark logs when Spark installs the additional LightGBM Scala dependencies.
  • See if we can make the verbosity of XGBoost and LightGBM configurable from the hlink config file.
  • Rewrite the introduction to the Models Sphinx docs page to include information pertinent to XGBoost and LightGBM.

riley-harper and others added 30 commits November 14, 2024 15:11
This test is currently failing if you have xgboost installed. If you don't have
xgboost installed, it skips itself to prevent failures due to missing packages
and dependencies.
…ype xgboost

This is only possible when we have the xgboost module, so raise an error if
that is not present.
…odel

This test is failing right now because we also need pyarrow>=4 when using
xgboost. We should add this as a dependency in the xgboost extra. If xgboost
isn't installed, this test skips itself.
…tras

This should let us have two different test setups for each Python version. One
with xgboost, one without.
I've also updated pytest to be more verbose for clarity.
Like some of the other models, xgboost returns an array of probabilities like
[probability_no, probability_yes]. So we extract just probability_yes as our
probability for hlink purposes.
xgboost has a different setup for feature importances, so the current logic
ignores it. We'll need to update the save model metadata step to include logic
specifically for xgboost.
This is really different from the Spark models, so I've made it a special case
instead of trying to integrate it with the previous logic closely. This section
might be due for some refactoring now.
This also updates Alabaster to 1.0.0.
This installs SynapseML, Microsoft's Apache Spark integrations package. It has
a synapse.ml.lightgbm module which we can use for LightGBM-PySpark integration.
…tras

This should let us have two different test setups for each Python version. One
with xgboost, one without.
One of these is failing because there's a bug where LightGBM throws an error on
interacted features.
Usually we don't care about the names of the vector attributes. But LightGBM
uses them as feature names and disallows some characters in the names.
Unfortunately, one of these characters is :, and Spark's Interaction names the
output of an interaction between A and B "A:B". I looked through the Spark code
and didn't see any way to configure the names of these output features. So I
think the easiest way forward here is to make a transformer that renames the
attributes of a vector by removing some characters and replacing them with
another.
The bug was that we didn't propagate the metadata changes into Java, so they
weren't persistent in something like a Pipeline. By calling withMetadata(), we
should now be persisting our changes correctly.
This merges the xgboost and lightgbm branches together. There were several
files with conflicts. Most of the conflicts I resolved by keeping the work from
both branches.
We now compute two feature importances for each model.

- weight: the number of splits that each feature causes
- gain: the total gain across all of each feature's splits
I'm still not entirely happy with this, but it's a tricky point in the code
because most of the models behave one way, but xgboost and lightgbm are
different. Some more refactoring might be in order.
This should hopefully let executors find the jars as well as the driver. I've
added some comments because this is a bit gnarly.
This is a resource directory for the Metals Scala LSP provider.
@riley-harper riley-harper requested a review from ccdavis November 25, 2024 17:33
- It turns out that multi-line TOML tables aren't allowed. So let's use the
  [training.chosen_model] syntax instead.
- I clarified the introductory information and made it general enough to apply
  to XGBoost and LightGBM as well.
The Spark Bucketizer adds commas to vector slot names, which cause
problems with LightGBM later in the pipeline. This is similar to the
issue with colons for Interaction, but the metadata for bucketized
vectors is a little bit different. So RenameVectorAttributes needed to
change a bit to handle the two different forms of metadata.
Generally clean up some small mistakes. I also added a comment to the
logic that removes the commas in core/pipeline.py.
@riley-harper riley-harper merged commit c52d835 into main Dec 4, 2024
6 checks passed
@riley-harper riley-harper deleted the add_new_ml_algs branch December 4, 2024 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add the LightGBM ML library Add the XGBoost ML library
1 participant