Skip to content

Commit

Permalink
Merge branch 'main' into v4-dev
Browse files Browse the repository at this point in the history
  • Loading branch information
riley-harper committed Dec 4, 2024
2 parents 7e7baa0 + 85a1818 commit 9542800
Show file tree
Hide file tree
Showing 38 changed files with 1,419 additions and 216 deletions.
5 changes: 3 additions & 2 deletions .github/workflows/docker-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,13 @@ jobs:
fail-fast: false
matrix:
python_version: ["3.10", "3.11", "3.12"]
hlink_extras: ["dev", "dev,lightgbm,xgboost"]
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Build the Docker image
run: docker build . --file Dockerfile --tag $HLINK_TAG-${{ matrix.python_version}} --build-arg PYTHON_VERSION=${{ matrix.python_version }}
run: docker build . --file Dockerfile --tag $HLINK_TAG-${{ matrix.python_version}} --build-arg PYTHON_VERSION=${{ matrix.python_version }} --build-arg HLINK_EXTRAS=${{ matrix.hlink_extras }}

- name: Check dependency versions
run: |
Expand All @@ -34,7 +35,7 @@ jobs:
run: docker run $HLINK_TAG-${{ matrix.python_version}} black --check .

- name: Test
run: docker run $HLINK_TAG-${{ matrix.python_version}} pytest
run: docker run $HLINK_TAG-${{ matrix.python_version}} pytest -ra

- name: Build sdist and wheel
run: docker run $HLINK_TAG-${{ matrix.python_version}} python -m build
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ scala_jar/target
scala_jar/project/target
*.class
*.cache
.metals/

# MacOS
.DS_Store
Expand Down
3 changes: 2 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
ARG PYTHON_VERSION=3.10
FROM python:${PYTHON_VERSION}
ARG HLINK_EXTRAS=dev

RUN apt-get update && apt-get install default-jre-headless -y

Expand All @@ -8,4 +9,4 @@ WORKDIR /hlink

COPY . .
RUN python -m pip install --upgrade pip
RUN pip install -e .[dev]
RUN pip install -e .[${HLINK_EXTRAS}]
49 changes: 43 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,19 +26,56 @@ We do our best to make hlink compatible with Python 3.10-3.12. If you have a
problem using hlink on one of these versions of Python, please open an issue
through GitHub. Versions of Python older than 3.10 are not supported.

Note that pyspark 3.5 does not yet officially support Python 3.12. If you
encounter pyspark-related import errors while running hlink on Python 3.12, try
Note that PySpark 3.5 does not yet officially support Python 3.12. If you
encounter PySpark-related import errors while running hlink on Python 3.12, try

- Installing the setuptools package. The distutils package was deleted from the
standard library in Python 3.12, but some versions of pyspark still import
standard library in Python 3.12, but some versions of PySpark still import
it. The setuptools package provides a hacky stand-in distutils library which
should fix some import errors in pyspark. We install setuptools in our
should fix some import errors in PySpark. We install setuptools in our
development and test dependencies so that our tests work on Python 3.12.

- Downgrading Python to 3.10 or 3.11. Pyspark officially supports these
versions of Python. So you should have better chances getting pyspark to work
- Downgrading Python to 3.10 or 3.11. PySpark officially supports these
versions of Python. So you should have better chances getting PySpark to work
well on Python 3.10 or 3.11.

### Additional Machine Learning Algorithms

hlink has optional support for two additional machine learning algorithms,
[XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) and
[LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html). Both of these
algorithms are highly performant gradient boosting libraries, each with its own
characteristics. These algorithms are not implemented directly in Spark, so
they require some additional dependencies. To install the required Python
dependencies, run

```
pip install hlink[xgboost]
```

for XGBoost or

```
pip install hlink[lightgbm]
```

for LightGBM. If you would like to install both at once, you can run

```
pip install hlink[xgboost,lightgbm]
```

to get the Python dependencies for both. Both XGBoost and LightGBM also require
libomp, which will need to be installed separately if you don't already have it.

After installing the dependencies for one or both of these algorithms, you can
use them as model types in training and model exploration. You can read more
about these models in the hlink documentation [here](https://hlink.docs.ipums.org/models.html).

*Note: The XGBoost-PySpark integration provided by the xgboost Python package is
currently unstable. So the hlink xgboost support is experimental and may change
in the future.*

## Docs

The documentation site can be found at [hlink.docs.ipums.org](https://hlink.docs.ipums.org).
Expand Down
2 changes: 1 addition & 1 deletion docs/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
config: a706061ae4b2d0ec765440a2505ca382
config: 3d084ea912736a6c4043e49bc2b58167
tags: 645f666f9bcd5a90fca523b33c5a78b7
4 changes: 2 additions & 2 deletions docs/.buildinfo.bak
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: de74adeb0864eb6d8e73600964a3e52d
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
config: a706061ae4b2d0ec765440a2505ca382
tags: 645f666f9bcd5a90fca523b33c5a78b7
195 changes: 152 additions & 43 deletions docs/_sources/models.md.txt
Original file line number Diff line number Diff line change
@@ -1,53 +1,80 @@
# Models

These are models available to be used in the model evaluation, training, and household training link tasks.

* Attributes for all models:
* `threshold` -- Type: `float`. Alpha threshold (model hyperparameter).
* `threshold_ratio` -- Type: `float`. Beta threshold (de-duplication distance ratio).
* Any parameters available in the model as defined in the Spark documentation can be passed as params using the label given in the Spark docs. Commonly used parameters are listed below with descriptive explanations from the Spark docs.
These are the machine learning models available for use in the model evaluation
and training tasks and in their household counterparts.

There are a few attributes available for all models.

* `type` -- Type: `string`. The name of the model type. The available model
types are listed below.
* `threshold` -- Type: `float`. The "alpha threshold". This is the probability
score required for a potential match to be labeled a match. `0 ≤ threshold ≤
1`.
* `threshold_ratio` -- Type: `float`. The threshold ratio or "beta threshold".
This applies to records which have multiple potential matches when
`training.decision` is set to `"drop_duplicate_with_threshold_ratio"`. For
each record, only potential matches which have the highest probability, have
a probability of at least `threshold`, *and* whose probabilities are at least
`threshold_ratio` times larger than the second-highest probability are
matches. This is sometimes called the "de-duplication distance ratio". `1 ≤
threshold_ratio < ∞`.

In addition, any model parameters documented in a model type's Spark
documentation can be passed as parameters to the model through hlink's
`training.chosen_model` and `training.model_exploration` configuration
sections.

Here is an example `training.chosen_model` configuration. The `type`,
`threshold`, and `threshold_ratio` attributes are hlink specific. `maxDepth` is
a parameter to the random forest model which hlink passes through to the
underlying Spark classifier.

```toml
[training.chosen_model]
type = "random_forest"
threshold = 0.2
threshold_ratio = 1.2
maxDepth = 5
```

## random_forest

Uses [pyspark.ml.classification.RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html). Returns probability as an array.
Uses [pyspark.ml.classification.RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html).
* Parameters:
* `maxDepth` -- Type: `int`. Maximum depth of the tree. Spark default value is 5.
* `numTrees` -- Type: `int`. The number of trees to train. Spark default value is 20, must be >= 1.
* `featureSubsetStrategy` -- Type: `string`. Per the Spark docs: "The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]."

```
model_parameters = {
type = "random_forest",
maxDepth = 5,
numTrees = 75,
featureSubsetStrategy = "sqrt",
threshold = 0.15,
threshold_ratio = 1.0
}
```toml
[training.chosen_model]
type = "random_forest"
threshold = 0.15
threshold_ratio = 1.0
maxDepth = 5
numTrees = 75
featureSubsetStrategy = "sqrt"
```

## probit

Uses [pyspark.ml.regression.GeneralizedLinearRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.GeneralizedLinearRegression.html) with `family="binomial"` and `link="probit"`.

```
model_parameters = {
type = "probit",
threshold = 0.85,
threshold_ratio = 1.2
}
```toml
[training.chosen_model]
type = "probit"
threshold = 0.85
threshold_ratio = 1.2
```

## logistic_regression

Uses [pyspark.ml.classification.LogisticRegression](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegression.html)

```
chosen_model = {
type = "logistic_regression",
threshold = 0.5,
threshold_ratio = 1.0
}
```toml
[training.chosen_model]
type = "logistic_regression"
threshold = 0.5
threshold_ratio = 1.0
```

## decision_tree
Expand All @@ -59,13 +86,14 @@ Uses [pyspark.ml.classification.DecisionTreeClassifier](https://spark.apache.org
* `minInstancesPerNode` -- Type `int`. Per the Spark docs: "Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1."
* `maxBins` -- Type: `int`. Per the Spark docs: "Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature."

```
chosen_model = {
type = "decision_tree",
maxDepth = 6,
minInstancesPerNode = 2,
maxBins = 4
}
```toml
[training.chosen_model]
type = "decision_tree"
threshold = 0.5
threshold_ratio = 1.5
maxDepth = 6
minInstancesPerNode = 2
maxBins = 4
```

## gradient_boosted_trees
Expand All @@ -77,13 +105,94 @@ Uses [pyspark.ml.classification.GBTClassifier](https://spark.apache.org/docs/lat
* `minInstancesPerNode` -- Type `int`. Per the Spark docs: "Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1."
* `maxBins` -- Type: `int`. Per the Spark docs: "Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature."

```toml
[training.chosen_model]
type = "gradient_boosted_trees"
threshold = 0.7
threshold_ratio = 1.3
maxDepth = 4
minInstancesPerNode = 1
maxBins = 6
```

## xgboost

*Added in version 3.8.0.*

XGBoost is an alternate, high-performance implementation of gradient boosting.
It uses [xgboost.spark.SparkXGBClassifier](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.spark.SparkXGBClassifier).
Since the XGBoost-PySpark integration which the xgboost Python package provides
is currently unstable, support for the xgboost model type is disabled in hlink
by default. hlink will stop with an error if you try to use this model type
without enabling support for it. To enable support for xgboost, install hlink
with the `xgboost` extra.

```
chosen_model = {
type = "gradient_boosted_trees",
maxDepth = 4,
minInstancesPerNode = 1,
maxBins = 6,
threshold = 0.7,
threshold_ratio = 1.3
}
pip install hlink[xgboost]
```

This installs the xgboost package and its Python dependencies. Depending on
your machine and operating system, you may also need to install the libomp
library, which is another dependency of xgboost. xgboost should raise a helpful
error if it detects that you need to install libomp.

You can view a list of xgboost's parameters
[here](https://xgboost.readthedocs.io/en/latest/parameter.html).

```toml
[training.chosen_model]
type = "xgboost"
threshold = 0.8
threshold_ratio = 1.5
max_depth = 5
eta = 0.5
gamma = 0.05
```

## lightgbm

*Added in version 3.8.0.*

LightGBM is another alternate, high-performance implementation of gradient
boosting. It uses
[synapse.ml.lightgbm.LightGBMClassifier](https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#module-synapse.ml.lightgbm.LightGBMClassifier).
`synapse.ml` is a library which provides various integrations with PySpark,
including integrations between the C++ LightGBM library and PySpark.

LightGBM requires some additional Scala libraries that hlink does not usually
install, so support for the lightgbm model is disabled in hlink by default.
hlink will stop with an error if you try to use this model type without
enabling support for it. To enable support for lightgbm, install hlink with the
`lightgbm` extra.

```
pip install hlink[lightgbm]
```

This installs the lightgbm package and its Python dependencies. Depending on
your machine and operating system, you may also need to install the libomp
library, which is another dependency of lightgbm. If you encounter errors when
training a lightgbm model, please try installing libomp if you do not have it
installed.

lightgbm has an enormous number of available parameters. Many of these are
available as normal in hlink, via the [LightGBMClassifier
class](https://mmlspark.blob.core.windows.net/docs/1.0.8/pyspark/synapse.ml.lightgbm.html#module-synapse.ml.lightgbm.LightGBMClassifier).
Others are available through the special `passThroughArgs` parameter, which
passes additional parameters through to the C++ library. You can see a full
list of the supported parameters
[here](https://lightgbm.readthedocs.io/en/latest/Parameters.html).

```toml
[training.chosen_model]
type = "lightgbm"
# hlink's threshold and threshold_ratio
threshold = 0.8
threshold_ratio = 1.5
# LightGBMClassifier supports these parameters (and many more).
maxDepth = 5
learningRate = 0.5
# LightGBMClassifier does not directly support this parameter,
# so we have to send it to the C++ library with passThroughArgs.
passThroughArgs = "force_row_wise=true"
```
2 changes: 1 addition & 1 deletion docs/_static/documentation_options.js
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
const DOCUMENTATION_OPTIONS = {
VERSION: '3.7.0',
VERSION: '3.8.0',
LANGUAGE: 'en',
COLLAPSE_INDEX: false,
BUILDER: 'html',
Expand Down
4 changes: 2 additions & 2 deletions docs/column_mappings.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<title>Column Mappings &#8212; hlink 3.7.0 documentation</title>
<title>Column Mappings &#8212; hlink 3.8.0 documentation</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=d1102ebc" />
<link rel="stylesheet" type="text/css" href="_static/basic.css?v=686e5160" />
<link rel="stylesheet" type="text/css" href="_static/alabaster.css?v=27fed22d" />
<script src="_static/documentation_options.js?v=229cbe3b"></script>
<script src="_static/documentation_options.js?v=948f11bf"></script>
<script src="_static/doctools.js?v=9bcbadda"></script>
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
<link rel="index" title="Index" href="genindex.html" />
Expand Down
4 changes: 2 additions & 2 deletions docs/comparison_features.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />

<title>Comparison Features &#8212; hlink 3.7.0 documentation</title>
<title>Comparison Features &#8212; hlink 3.8.0 documentation</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css?v=d1102ebc" />
<link rel="stylesheet" type="text/css" href="_static/basic.css?v=686e5160" />
<link rel="stylesheet" type="text/css" href="_static/alabaster.css?v=27fed22d" />
<script src="_static/documentation_options.js?v=229cbe3b"></script>
<script src="_static/documentation_options.js?v=948f11bf"></script>
<script src="_static/doctools.js?v=9bcbadda"></script>
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
<link rel="index" title="Index" href="genindex.html" />
Expand Down
Loading

0 comments on commit 9542800

Please sign in to comment.