Skip to content

Commit

Permalink
[#161, #162] Merge branch 'add_xgboost' into add_new_ml_algs
Browse files Browse the repository at this point in the history
This merges the xgboost and lightgbm branches together. There were several
files with conflicts. Most of the conflicts I resolved by keeping the work from
both branches.
  • Loading branch information
riley-harper committed Nov 21, 2024
2 parents 444c6a7 + ab1d83a commit aae00f6
Show file tree
Hide file tree
Showing 30 changed files with 636 additions and 318 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/docker-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
fail-fast: false
matrix:
python_version: ["3.10", "3.11", "3.12"]
hlink_extras: ["dev", "dev,lightgbm"]
hlink_extras: ["dev", "dev,lightgbm,xgboost"]
runs-on: ubuntu-latest

steps:
Expand All @@ -33,7 +33,7 @@ jobs:
run: docker run $HLINK_TAG-${{ matrix.python_version}} flake8 --count .

- name: Test
run: docker run $HLINK_TAG-${{ matrix.python_version}} pytest
run: docker run $HLINK_TAG-${{ matrix.python_version}} pytest -ra

- name: Build sdist and wheel
run: docker run $HLINK_TAG-${{ matrix.python_version}} python -m build
45 changes: 39 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,19 +26,52 @@ We do our best to make hlink compatible with Python 3.10-3.12. If you have a
problem using hlink on one of these versions of Python, please open an issue
through GitHub. Versions of Python older than 3.10 are not supported.

Note that pyspark 3.5 does not yet officially support Python 3.12. If you
encounter pyspark-related import errors while running hlink on Python 3.12, try
Note that PySpark 3.5 does not yet officially support Python 3.12. If you
encounter PySpark-related import errors while running hlink on Python 3.12, try

- Installing the setuptools package. The distutils package was deleted from the
standard library in Python 3.12, but some versions of pyspark still import
standard library in Python 3.12, but some versions of PySpark still import
it. The setuptools package provides a hacky stand-in distutils library which
should fix some import errors in pyspark. We install setuptools in our
should fix some import errors in PySpark. We install setuptools in our
development and test dependencies so that our tests work on Python 3.12.

- Downgrading Python to 3.10 or 3.11. Pyspark officially supports these
versions of Python. So you should have better chances getting pyspark to work
- Downgrading Python to 3.10 or 3.11. PySpark officially supports these
versions of Python. So you should have better chances getting PySpark to work
well on Python 3.10 or 3.11.

### XGBoost Support

[XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) is a highly
performant gradient boosting machine learning library. hlink includes optional
support for XGBoost through the xgboost Python package. This support is
experimental and may change since the XGBoost-PySpark integration provided by
the xgboost package is currently unstable.

To install the xgboost package and its Python dependencies, run `pip install
hlink[xgboost]`. This may be enough to get xgboost running on some machines. If
you run into further errors, you might need to install the libomp package,
which xgboost requires.

After installing xgboost, you can use it as a model type in training and model
exploration. xgboost has a large list of available parameters, which you can
check out [here](https://xgboost.readthedocs.io/en/latest/parameter.html).
hlink passes parameters defined in your config file through to the xgboost
library.

```toml
# max_depth, eta, and gamma are parameters for xgboost. threshold and
# threshold_ratio are hlink-specific configurations universal to all model types.
chosen_model = {
type = "xgboost",
max_depth = 5,
eta = 0.5,
gamma = 0.05,
threshold = 0.5,
threshold_ratio = 2.0
}
```


## Docs

The documentation site can be found at [hlink.docs.ipums.org](https://hlink.docs.ipums.org).
Expand Down
35 changes: 35 additions & 0 deletions docs/_sources/models.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,38 @@ chosen_model = {
threshold_ratio = 1.3
}
```

## xgboost

*Added in version 3.8.0.*

This is an alternate, high-performance implementation of gradient boosting.
It uses [xgboost.spark.SparkXGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.spark.SparkXGBClassifier).
Since the XGBoost-PySpark integration which the xgboost Python package provides
is currently unstable, support for the xgboost model type is disabled in hlink
by default. hlink will stop with an error if you try to use this model type
without enabling support for it. To enable support for xgboost, install hlink
with the `xgboost` extra.

```
pip install hlink[xgboost]
```

This installs the xgboost package and its Python dependencies. Depending on
your machine and operating system, you may also need to install the libomp
library, which is another dependency of xgboost. xgboost should raise a helpful
error if it detects that you need to install libomp.

You can view a list of xgboost's parameters
[here](https://xgboost.readthedocs.io/en/latest/parameter.html).

```
chosen_model = {
type = "xgboost",
max_depth = 5,
eta = 0.5,
gamma = 0.05,
threshold = 0.8,
threshold_ratio = 1.5
}
```
115 changes: 35 additions & 80 deletions docs/_static/alabaster.css
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
@import url("basic.css");

/* -- page layout ----------------------------------------------------------- */

body {
Expand Down Expand Up @@ -160,8 +158,8 @@ div.sphinxsidebar input {
font-size: 1em;
}

div.sphinxsidebar #searchbox input[type="text"] {
width: 160px;
div.sphinxsidebar #searchbox {
margin: 1em 0;
}

div.sphinxsidebar .search > div {
Expand Down Expand Up @@ -263,10 +261,6 @@ div.admonition p.last {
margin-bottom: 0;
}

div.highlight {
background-color: #fff;
}

dt:target, .highlight {
background: #FAF3E8;
}
Expand Down Expand Up @@ -454,7 +448,7 @@ ul, ol {
}

pre {
background: #EEE;
background: unset;
padding: 7px 30px;
margin: 15px 0px;
line-height: 1.3em;
Expand Down Expand Up @@ -485,15 +479,15 @@ a.reference {
border-bottom: 1px dotted #004B6B;
}

a.reference:hover {
border-bottom: 1px solid #6D4100;
}

/* Don't put an underline on images */
a.image-reference, a.image-reference:hover {
border-bottom: none;
}

a.reference:hover {
border-bottom: 1px solid #6D4100;
}

a.footnote-reference {
text-decoration: none;
font-size: 0.7em;
Expand All @@ -509,68 +503,7 @@ a:hover tt, a:hover code {
background: #EEE;
}


@media screen and (max-width: 870px) {

div.sphinxsidebar {
display: none;
}

div.document {
width: 100%;

}

div.documentwrapper {
margin-left: 0;
margin-top: 0;
margin-right: 0;
margin-bottom: 0;
}

div.bodywrapper {
margin-top: 0;
margin-right: 0;
margin-bottom: 0;
margin-left: 0;
}

ul {
margin-left: 0;
}

li > ul {
/* Matches the 30px from the "ul, ol" selector above */
margin-left: 30px;
}

.document {
width: auto;
}

.footer {
width: auto;
}

.bodywrapper {
margin: 0;
}

.footer {
width: auto;
}

.github {
display: none;
}



}



@media screen and (max-width: 875px) {
@media screen and (max-width: 940px) {

body {
margin: 0;
Expand All @@ -580,12 +513,16 @@ a:hover tt, a:hover code {
div.documentwrapper {
float: none;
background: #fff;
margin-left: 0;
margin-top: 0;
margin-right: 0;
margin-bottom: 0;
}

div.sphinxsidebar {
display: block;
float: none;
width: 102.5%;
width: unset;
margin: 50px -30px -20px -30px;
padding: 10px 20px;
background: #333;
Expand Down Expand Up @@ -620,8 +557,14 @@ a:hover tt, a:hover code {

div.body {
min-height: 0;
min-width: auto; /* fixes width on small screens, breaks .hll */
padding: 0;
}

.hll {
/* "fixes" the breakage */
width: max-content;
}

.rtd_doc_footer {
display: none;
Expand All @@ -635,13 +578,18 @@ a:hover tt, a:hover code {
width: auto;
}

.footer {
width: auto;
}

.github {
display: none;
}

ul {
margin-left: 0;
}

li > ul {
/* Matches the 30px from the "ul, ol" selector above */
margin-left: 30px;
}
}


Expand Down Expand Up @@ -705,4 +653,11 @@ nav#breadcrumbs li+li:before {
div.related {
display: none;
}
}

img.github {
position: absolute;
top: 0;
border: 0;
right: 0;
}
5 changes: 5 additions & 0 deletions docs/_static/github-banner.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 13 additions & 13 deletions docs/column_mappings.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@

<title>Column Mappings &#8212; hlink 3.7.0 documentation</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=d1102ebc" />
<link rel="stylesheet" type="text/css" href="_static/alabaster.css?v=12dfc556" />
<link rel="stylesheet" type="text/css" href="_static/basic.css?v=686e5160" />
<link rel="stylesheet" type="text/css" href="_static/alabaster.css?v=27fed22d" />
<script src="_static/documentation_options.js?v=229cbe3b"></script>
<script src="_static/doctools.js?v=9bcbadda"></script>
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
Expand Down Expand Up @@ -369,7 +370,16 @@ <h1 class="logo"><a href="index.html">hlink</a></h1>



<h3>Navigation</h3>

<search id="searchbox" style="display: none" role="search">
<div class="searchformwrapper">
<form class="search" action="search.html" method="get">
<input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false" placeholder="Search"/>
<input type="submit" value="Go" />
</form>
</div>
</search>
<script>document.getElementById('searchbox').style.display = "block"</script><h3>Navigation</h3>
<ul>
<li class="toctree-l1"><a class="reference internal" href="introduction.html">Introduction</a></li>
<li class="toctree-l1"><a class="reference internal" href="installation.html">Installation</a></li>
Expand Down Expand Up @@ -403,16 +413,6 @@ <h3>Related Topics</h3>
</ul></li>
</ul>
</div>
<search id="searchbox" style="display: none" role="search">
<h3 id="searchlabel">Quick search</h3>
<div class="searchformwrapper">
<form class="search" action="search.html" method="get">
<input type="text" name="q" aria-labelledby="searchlabel" autocomplete="off" autocorrect="off" autocapitalize="off" spellcheck="false"/>
<input type="submit" value="Go" />
</form>
</div>
</search>
<script>document.getElementById('searchbox').style.display = "block"</script>



Expand All @@ -430,7 +430,7 @@ <h3 id="searchlabel">Quick search</h3>

|
Powered by <a href="https://www.sphinx-doc.org/">Sphinx 8.1.3</a>
&amp; <a href="https://alabaster.readthedocs.io">Alabaster 0.7.16</a>
&amp; <a href="https://alabaster.readthedocs.io">Alabaster 1.0.0</a>

|
<a href="_sources/column_mappings.md.txt"
Expand Down
Loading

0 comments on commit aae00f6

Please sign in to comment.