Skip to content

Releases: ipums/hlink

v4.0.0a1

13 Dec 22:10
Compare
Choose a tag to compare
v4.0.0a1 Pre-release
Pre-release

Version 4.0.0 Alpha 1

This pre-release has upcoming changes for version 4 of hlink. Since this includes breaking changes and an overhaul of the model exploration task, we'd like to test it out a bit before creating a full release. Part of the work yet to be done is documentation and code cleanup. The documentation for these changes and new features is lacking so far. Here is a preview of the version 4 highlights (so far!):

  • Completely overhauled the model exploration task, switching to a nested cross-validation algorithm.
  • Added support for a third strategy for generating models to test in model exploration. Along with "explicit" (take exactly what's in training.model_parameters) and grid search, there is now randomized search. Randomized search takes a certain number of samples from a distribution defined in training.model_parameters.
  • Added the F-measure metric to the model exploration output, and simplified the output so that it always has the same columns.
  • Removed the training.output_suspicious_TD configuration option because it was rarely used and presented code and performance issues. Removing output_suspicious_TD makes the model exploration code more maintainable and helps it run more quickly.
  • Disentangled two core modules (classifier and pipeline) from the configuration format by changing the arguments to a couple of functions. This should help separate those concerns more neatly and make changes to the configuration easier if we end up doing that in the future.
  • Changed SparkConnection to require a checkpoint_dir argument, which fixes a bug related to Spark configuration.

v3.8.0

04 Dec 20:36
85a1818
Compare
Choose a tag to compare

What's Changed

  • Added optional support for two new gradient boosting ML libraries: XGBoost and LightGBM. You can read more about these libraries and how to install them with their dependencies in the docs here. PR #165
  • Added a new hlink.linking.transformers.RenameVectorAttributes transformer which can rename the attributes or "slots" of Spark vector columns. Hlink uses this to support LightGBM, which disallows certain characters in its feature names. PR #165
  • Documented comparisons, which are not the same as comparison features. Previously the documentation was misleading and seemed to indicate that these were the same thing. PR #159
  • Fixed a bug in the substitution file documentation. The documentation had the meaning of the substitution file columns flip-flopped, which was confusing. PR #166

Developer-Facing Changes

  • Updated Sphinx to 8.1.3 and fixed two Sphinx build warnings. PR #159
  • Updated CI/CD to automatically run only on PRs and on pushes to main. You can also now manually trigger a CI/CD run from the Actions tab in GitHub. Also removed the custom "quickcheck" pytest marker in favor of using pytest -k and removed flake8 from CI/CD because it kept causing more trouble than it was worth. PR #164

Full Changelog: v3.7.0...v3.8.0

v3.7.0

10 Oct 17:16
c1713e5
Compare
Choose a tag to compare

What's Changed

  • Add tests to cover several untested sections of code by @riley-harper in #147
  • Refactor core.transforms.generate_transforms() for readability and maintainability; improve documentation and type hints by @riley-harper in #148
  • Fix tests for Python 3.12 and clarify Python 3.12 support and dependence on PySpark by @riley-harper in #151
  • Improve logging by writing to module-level loggers instead of the root logger by @riley-harper in #152
  • Support setting the app name via an optional argument in SparkConnection. The default behavior of setting the app name to "linking" is unchanged. By @riley-harper in #156
  • Improve model_exploration step 2 terminal output, logging, and documentation to make the step more understandable by @riley-harper in #155

Full Changelog: v3.6.1...v3.7.0

v3.6.1

14 Aug 21:26
54d4820
Compare
Choose a tag to compare

What's Changed

  • Support blocking sections with multiple exploded columns by @riley-harper in #143. This fixes a bug that caused a crash in Matching step 0 - explode.

Full Changelog: v3.6.0...v3.6.1

v3.6.0

18 Jun 20:12
94f0e8b
Compare
Choose a tag to compare

What's Changed

  • Support OR conditions in blocking by @riley-harper in #138. This new feature supports connecting some or all blocking conditions together with ORs instead of with ANDs. You can read more documentation about it under the "or_group" bullet point here.
  • Unskip several skipped tests by @riley-harper in #139. This is a development change that should not affect users.

Full Changelog: v3.5.5...v3.6.0

v3.5.5

31 May 17:30
bd69a9e
Compare
Choose a tag to compare

What's Changed

  • Support a variable number of columns in the array feature selection transform by @riley-harper in #135

Full Changelog: v3.5.4...v3.5.5

v3.5.4

20 Feb 20:20
da9db20
Compare
Choose a tag to compare

What's Changed

Full Changelog: v3.5.3...v3.5.4

v3.5.3

02 Nov 15:58
c0f0619
Compare
Choose a tag to compare

Highlights

In this release we start supporting Python 3.12 and remove the ceiling on most of our dependency versions to support this. We also fix a bug with one-hot encoding and add an additional check to config validation that looks for duplicate output columns for some config sections.

What's Changed

  • Refactor to use colorama in a simpler way by @jrbalch543 in #115. User-facing functionality should be unchanged.
  • Add checks for duplicated comparison features, feature selection, and column mappings by @jrbalch543 in #113. This will cause a validation error when there are duplicated aliases or output columns for these sections.
  • Clean up a couple of core modules by @jrbalch543 in #117. These changes are internal refactoring and don't affect functionality.
  • Upgrade dependencies by pinning them more loosely and support Python 3.12 by @riley-harper in #119. This removes the upper limit on almost all of our dependencies so that users can more freely pick versions for themselves. Our tests run on the most recent available versions for each particular version of Python. Unpinning these dependencies allows us to easily support Python 3.12, which we now support and run in CI/CD.
  • Update the docs to include Python 3.12 by @riley-harper in #120
  • Revert to handleInvalid = "keep" for OneHotEncoder by @riley-harper in #121. This is a bug that we introduced in the last release. Although it's not common, it does sometimes happen that our training data doesn't cover every category present in the matching data. We would rather silently continue and ignore these cases by giving them a coefficient of 0 than error out on them.
  • Put the config file name in the script prompt by @riley-harper in #123. This is a small quality of life feature that makes it easier to remember which config file you're running during long hlink runs.

Full Changelog: v3.5.2...v3.5.3

v3.5.2

26 Oct 15:21
d51c254
Compare
Choose a tag to compare

What's Changed

  • Fixed zipping issue in Training step 3 by @jrbalch543 in #104
  • Fix a bug in Training step 3 for categorical features by @jrbalch543 and @riley-harper in #107. Each categorical feature was getting a single coefficient when each category should get its own coefficient instead.
  • Error out on invalid categories in training data instead of creating a new category for them by @riley-harper in #109. This bug fix reduces the number of categories created by hlink by 1. The last category represented missing or invalid data, but these categories were pretty much always unused because hlink creates exhaustive categories whenever possible. Users can still manually mark missing data by creating their own category for it, but hlink will not do this by default anymore. This should help prevent silent errors and confusion with missing data.
  • Fix a bug where categorical features created by interaction caused Training step 3 to crash by @riley-harper in #111
  • Tweak the format of Training step 3's output by @riley-harper in #112. There are now 3 columns: feature_name, category, and coefficient_or_importance. Feature names aren't suffixed with the category value anymore.

Full Changelog: v3.5.1...v3.5.2

v3.5.1

23 Oct 20:10
6711c54
Compare
Choose a tag to compare

What's Changed

  • Implement a new Training step that replaces Model Exploration step 3 by @jrbalch543 and @riley-harper in #101. This new step replaces the broken "get feature importances" step in Model Exploration, which now is removed. Training step 3 saves model feature importances or coefficients when training.feature_importances is set to true in the config file.

New Contributors

Full Changelog: v3.5.0...v3.5.1