Add support for XGBoost and LightGBM #165

riley-harper · 2024-11-25T17:31:01Z

Overview

This PR closes #161 and closes #162 by adding support for the XGBoost and LightGBM machine learning models.

We've opted to make both of these models available through optional installation with the Python extras syntax. The XGBoost-PySpark integration is still experimental/unstable, and LightGBM has some significant Scala dependencies, so we thought it best to let users decide whether they wanted to bother with these models and their dependencies or not. This means that this PR should be entirely additive, and current users should not see any changes in hlink unless they start using XGBoost and/or LightGBM.

You can install the Python dependencies of these packages with pip.

pip install hlink[xgboost]
pip install hlink[lightgbm]
# You can install both at once too
pip install hlink[xgboost,lightgbm]

In the code, we detect whether these features are available by trying to import their respective Python packages and setting a flag to indicate whether the import succeeded or failed.

try:
    import synapse.ml.lightgbm
except ModuleNotFoundError:
    _lightgbm_available = False
else:
    _lightgbm_available = True

try:
    import xgboost.spark
except ModuleNotFoundError:
    _xgboost_available = False
else:
    _xgboost_available = True

This has the nice benefit of keeping all of our imports at the top of the file, not in functions or methods. If a user tries to use XGBoost or LightGBM without the required Python package installed, we throw an error with a (hopefully helpful) detailed error message.

ModuleNotFoundError: To use the 'lightgbm' model type, you need to install the synapseml
Python package, which provides LightGBM-Spark integration, and its dependencies. Try installing
hlink with the lightgbm extra:

                   pip install hlink[lightgbm]

Finally, the tests also detect whether the needed packages are installed, and they use pytest.mark.skipif to skip themselves if the packages aren't available. We've updated the CI/CD matrix to run with and without these dependencies. So now there are 6 total test CI/CD runs:

Python version	hlink extras
3.10	dev
3.10	dev,xgboost,lightgbm
3.11	dev
3.11	dev,xgboost,lightgbm
3.12	dev
3.12	dev,xgboost,lightgbm

Code Changes Needed for the New Models

For the most part, the new models integrate cleanly with the existing code. There were a few areas where we had to make some updates.

LightGBM does not support feature names with special characters like colons. Spark's Interaction class automatically adds colons to its output vector's attribute names. Eventually these colons end up in the names of the output column of the VectorAssembler. This caused LightGBM to fail with an error. So we've made a new RenameVectorAttributes class which transforms a vector column by replacing strings in its attribute names. We now add a RenameVectorAttributes object to the pipeline each time we add an Interaction, so that we can eliminate the colons from the attribute names.
Both XGBoost and LightGBM have multiple feature importance metrics. All of the previous models have only had a single feature importance metric. We've chosen to use weight (number of splits) and gain (total gain across all splits made by the feature) for both XGBoost and LightGBM. This took some refactoring of training step 3 - save model metadata. It's still a little bit clunky, but is hopefully easy enough to read.
In hlink.spark.session, we now install LightGBM's Scala dependencies if the synapse.ml Python package is present, using the ADD JAR SQL command. I had trouble using the spark.jars.packages configuration setting, although it really seemed like it should have worked.
In the tests, we've made a new hlink.tests.markers module with requires_xgboost and requires_lightgbm markers (decorators). These automatically skip the decorated tests when the required Python packages are not present. This works nicely for testing both locally and CI/CD, and if you run with pytest -ra, it will tell you why it skipped the tests:

=================================================== short test summary info ===================================================
SKIPPED [1] hlink/tests/core/classifier_test.py:10: requires the lightgbm library
SKIPPED [1] hlink/tests/core/classifier_test.py:21: requires the xgboost library
SKIPPED [1] hlink/tests/integration_score_with_trained_models_test.py:497: requires the lightgbm library
SKIPPED [1] hlink/tests/integration_score_with_trained_models_test.py:956: requires the xgboost library
SKIPPED [1] hlink/tests/training_test.py:435: requires the lightgbm library
SKIPPED [1] hlink/tests/training_test.py:498: requires the lightgbm library
SKIPPED [1] hlink/tests/training_test.py:556: requires the xgboost library

To Do Items (Nice to Have)

See if we can reduce the verbosity of the Spark logs when Spark installs the additional LightGBM Scala dependencies.
See if we can make the verbosity of XGBoost and LightGBM configurable from the hlink config file.
Rewrite the introduction to the Models Sphinx docs page to include information pertinent to XGBoost and LightGBM.

This test is currently failing if you have xgboost installed. If you don't have xgboost installed, it skips itself to prevent failures due to missing packages and dependencies.

…ype xgboost This is only possible when we have the xgboost module, so raise an error if that is not present.

…odel This test is failing right now because we also need pyarrow>=4 when using xgboost. We should add this as a dependency in the xgboost extra. If xgboost isn't installed, this test skips itself.

…tras This should let us have two different test setups for each Python version. One with xgboost, one without.

I've also updated pytest to be more verbose for clarity.

Like some of the other models, xgboost returns an array of probabilities like [probability_no, probability_yes]. So we extract just probability_yes as our probability for hlink purposes.

xgboost has a different setup for feature importances, so the current logic ignores it. We'll need to update the save model metadata step to include logic specifically for xgboost.

This is really different from the Spark models, so I've made it a special case instead of trying to integrate it with the previous logic closely. This section might be due for some refactoring now.

This also updates Alabaster to 1.0.0.

This installs SynapseML, Microsoft's Apache Spark integrations package. It has a synapse.ml.lightgbm module which we can use for LightGBM-PySpark integration.

This is currently failing.

…tras This should let us have two different test setups for each Python version. One with xgboost, one without.

One of these is failing because there's a bug where LightGBM throws an error on interacted features.

Usually we don't care about the names of the vector attributes. But LightGBM uses them as feature names and disallows some characters in the names. Unfortunately, one of these characters is :, and Spark's Interaction names the output of an interaction between A and B "A:B". I looked through the Spark code and didn't see any way to configure the names of these output features. So I think the easiest way forward here is to make a transformer that renames the attributes of a vector by removing some characters and replacing them with another.

… some extra params

The bug was that we didn't propagate the metadata changes into Java, so they weren't persistent in something like a Pipeline. By calling withMetadata(), we should now be persisting our changes correctly.

…tion output for LightGBM

…e post-transformer

…eVectorAttributes

This merges the xgboost and lightgbm branches together. There were several files with conflicts. Most of the conflicts I resolved by keeping the work from both branches.

We now compute two feature importances for each model. - weight: the number of splits that each feature causes - gain: the total gain across all of each feature's splits

I'm still not entirely happy with this, but it's a tricky point in the code because most of the models behave one way, but xgboost and lightgbm are different. Some more refactoring might be in order.

This should hopefully let executors find the jars as well as the driver. I've added some comments because this is a bit gnarly.

… and lightgbm

This is a resource directory for the Metals Scala LSP provider.

- It turns out that multi-line TOML tables aren't allowed. So let's use the [training.chosen_model] syntax instead. - I clarified the introductory information and made it general enough to apply to XGBoost and LightGBM as well.

The Spark Bucketizer adds commas to vector slot names, which cause problems with LightGBM later in the pipeline. This is similar to the issue with colons for Interaction, but the metadata for bucketized vectors is a little bit different. So RenameVectorAttributes needed to change a bit to handle the two different forms of metadata.

Generally clean up some small mistakes. I also added a comment to the logic that removes the commas in core/pipeline.py.

riley-harper and others added 30 commits November 14, 2024 15:11

[#161] Add xgboost as an optional dependency

fbb4b44

[#161] Add a test for xgboost classifier support

a51f20f

This test is currently failing if you have xgboost installed. If you don't have xgboost installed, it skips itself to prevent failures due to missing packages and dependencies.

[#161] Run black

010f3f5

[#161] Ignore flake8 unused import error

a865825

[#161] Create a SparkXGBClassifier in choose_classifier() for model_t…

287912e

…ype xgboost This is only possible when we have the xgboost module, so raise an error if that is not present.

[#161] Add a test that runs the whole training task with an xgboost m…

a7b0c37

…odel This test is failing right now because we also need pyarrow>=4 when using xgboost. We should add this as a dependency in the xgboost extra. If xgboost isn't installed, this test skips itself.

[#161] Update the Dockerfile to support build with different hlink ex…

5c6fdc9

…tras This should let us have two different test setups for each Python version. One with xgboost, one without.

[#161] Update docker-build.yml to run tests with and without xgboost

a259811

I've also updated pytest to be more verbose for clarity.

[#161] Add pyarrow as a dependency for the xgboost extra

a95992c

[#161] Factor conditional xgboost test logic into a single marker

c64cf43

[#161] Add an integration test for xgboost, set the post-transformer

88d7199

Like some of the other models, xgboost returns an array of probabilities like [probability_no, probability_yes]. So we extract just probability_yes as our probability for hlink purposes.

[#161] Update test to check xgboost training_feature_importances

97aa7e2

xgboost has a different setup for feature importances, so the current logic ignores it. We'll need to update the save model metadata step to include logic specifically for xgboost.

[#161] Pull column and category logic before feature importances logic

7423169

[#161] Support saving model metadata for xgboost

ffba81a

This is really different from the Spark models, so I've made it a special case instead of trying to integrate it with the previous logic closely. This section might be due for some refactoring now.

[#161] Rename a variable in training step 3

0277d7d

[#161] Make the "xgboost is missing" error message more helpful

3ca3952

[#161] Update the README with information on XGBoost

b992ba5

[#161] Add information about xgboost to models.md

3065310

[#161] Regenerate Sphinx docs

ab1d83a

This also updates Alabaster to 1.0.0.

[#162] Create a lightgbm hlink extra

59033b2

This installs SynapseML, Microsoft's Apache Spark integrations package. It has a synapse.ml.lightgbm module which we can use for LightGBM-PySpark integration.

[#162] Create a test for choose_classifier() support for lightgbm

88956ec

This is currently failing.

[#162] Allow model_type lightgbm in choose_classifier()

dcafbc0

[#162] Fix a flake8 error

83f6b5c

[#161] Update the Dockerfile to support build with different hlink ex…

48a93ef

…tras This should let us have two different test setups for each Python version. One with xgboost, one without.

[#162] Run CI/CD once with lightgbm and once without

1aef721

[#162] Add two training tests for lightgbm

72fd83c

One of these is failing because there's a bug where LightGBM throws an error on interacted features.

[#162] Implement basic RenameVectorAttributes logic

7c34bab

[#162] Implement RenameVectorAttributes and make it more flexible via…

a4f3534

… some extra params

[#162] Fix a bug in RenameVectorAttributes

2e58078

The bug was that we didn't propagate the metadata changes into Java, so they weren't persistent in something like a Pipeline. By calling withMetadata(), we should now be persisting our changes correctly.

riley-harper added 14 commits November 20, 2024 10:55

[#162] Integrate RenameVectorAttributes to remove colons from Interac…

8150ee5

…tion output for LightGBM

[#162] Add an integration test for matching with LightGBM, and set th…

b2dfa4e

…e post-transformer

[#162] Add hlink notice to the top of new files, add logging to Renam…

444c6a7

…eVectorAttributes

[#161, #162] Merge branch 'add_xgboost' into add_new_ml_algs

aae00f6

This merges the xgboost and lightgbm branches together. There were several files with conflicts. Most of the conflicts I resolved by keeping the work from both branches.

Merge branch 'main' into add_new_ml_algs

34b1a26

[#162] Integrate LightGBM with training step 3

7f7afe7

[#161, #162] Unify feature importances for XGBoost and LightGBM

5ef0879

We now compute two feature importances for each model. - weight: the number of splits that each feature causes - gain: the total gain across all of each feature's splits

[#161, #162] Refactor training step 3 to reduce duplication

010f46a

I'm still not entirely happy with this, but it's a tricky point in the code because most of the models behave one way, but xgboost and lightgbm are different. Some more refactoring might be in order.

[#161, #162] Rename some variables and add logging in training step 3

7864432

[#162] Swap to using ADD JAR instead of spark.jars.packages

7dcc81d

This should hopefully let executors find the jars as well as the driver. I've added some comments because this is a bit gnarly.

[#162] Add lightgbm docs to sphinx-docs/models.md

6ae1f4e

[#161, #162] Match up the "missing module" error messages for xgboost…

987f71c

… and lightgbm

[#161, #162] Update the README with docs on xgboost and lightgbm

aeaef93

Remove .metals/ and ignore it

35d527c

This is a resource directory for the Metals Scala LSP provider.

riley-harper requested a review from ccdavis November 25, 2024 17:33

riley-harper added 4 commits November 25, 2024 12:53

[#161, #162] Update the models.md Sphinx docs page

96b0e0f

- It turns out that multi-line TOML tables aren't allowed. So let's use the [training.chosen_model] syntax instead. - I clarified the introductory information and made it general enough to apply to XGBoost and LightGBM as well.

[#162] Require lightgbm for a new test, remove debugging output

c5bf26e

Generally clean up some small mistakes. I also added a comment to the logic that removes the commas in core/pipeline.py.

Merge branch 'main' into add_new_ml_algs

52d7721

riley-harper merged commit c52d835 into main Dec 4, 2024
6 checks passed

riley-harper deleted the add_new_ml_algs branch December 4, 2024 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for XGBoost and LightGBM #165

Add support for XGBoost and LightGBM #165

riley-harper commented Nov 25, 2024 •

edited

Loading

Add support for XGBoost and LightGBM #165

Add support for XGBoost and LightGBM #165

Conversation

riley-harper commented Nov 25, 2024 • edited Loading

Overview

Code Changes Needed for the New Models

To Do Items (Nice to Have)

riley-harper commented Nov 25, 2024 •

edited

Loading