Releases: ipums/hlink
v4.0.0a1
Version 4.0.0 Alpha 1
This pre-release has upcoming changes for version 4 of hlink. Since this includes breaking changes and an overhaul of the model exploration task, we'd like to test it out a bit before creating a full release. Part of the work yet to be done is documentation and code cleanup. The documentation for these changes and new features is lacking so far. Here is a preview of the version 4 highlights (so far!):
- Completely overhauled the model exploration task, switching to a nested cross-validation algorithm.
- Added support for a third strategy for generating models to test in model exploration. Along with "explicit" (take exactly what's in
training.model_parameters
) and grid search, there is now randomized search. Randomized search takes a certain number of samples from a distribution defined intraining.model_parameters
. - Added the F-measure metric to the model exploration output, and simplified the output so that it always has the same columns.
- Removed the
training.output_suspicious_TD
configuration option because it was rarely used and presented code and performance issues. Removingoutput_suspicious_TD
makes the model exploration code more maintainable and helps it run more quickly. - Disentangled two core modules (
classifier
andpipeline
) from the configuration format by changing the arguments to a couple of functions. This should help separate those concerns more neatly and make changes to the configuration easier if we end up doing that in the future. - Changed
SparkConnection
to require acheckpoint_dir
argument, which fixes a bug related to Spark configuration.
v3.8.0
What's Changed
- Added optional support for two new gradient boosting ML libraries: XGBoost and LightGBM. You can read more about these libraries and how to install them with their dependencies in the docs here. PR #165
- Added a new
hlink.linking.transformers.RenameVectorAttributes
transformer which can rename the attributes or "slots" of Spark vector columns. Hlink uses this to support LightGBM, which disallows certain characters in its feature names. PR #165 - Documented comparisons, which are not the same as comparison features. Previously the documentation was misleading and seemed to indicate that these were the same thing. PR #159
- Fixed a bug in the substitution file documentation. The documentation had the meaning of the substitution file columns flip-flopped, which was confusing. PR #166
Developer-Facing Changes
- Updated Sphinx to 8.1.3 and fixed two Sphinx build warnings. PR #159
- Updated CI/CD to automatically run only on PRs and on pushes to main. You can also now manually trigger a CI/CD run from the Actions tab in GitHub. Also removed the custom "quickcheck" pytest marker in favor of using
pytest -k
and removed flake8 from CI/CD because it kept causing more trouble than it was worth. PR #164
Full Changelog: v3.7.0...v3.8.0
v3.7.0
What's Changed
- Add tests to cover several untested sections of code by @riley-harper in #147
- Refactor core.transforms.generate_transforms() for readability and maintainability; improve documentation and type hints by @riley-harper in #148
- Fix tests for Python 3.12 and clarify Python 3.12 support and dependence on PySpark by @riley-harper in #151
- Improve logging by writing to module-level loggers instead of the root logger by @riley-harper in #152
- Support setting the app name via an optional argument in SparkConnection. The default behavior of setting the app name to "linking" is unchanged. By @riley-harper in #156
- Improve model_exploration step 2 terminal output, logging, and documentation to make the step more understandable by @riley-harper in #155
Full Changelog: v3.6.1...v3.7.0
v3.6.1
What's Changed
- Support blocking sections with multiple exploded columns by @riley-harper in #143. This fixes a bug that caused a crash in Matching step 0 - explode.
Full Changelog: v3.6.0...v3.6.1
v3.6.0
What's Changed
- Support OR conditions in blocking by @riley-harper in #138. This new feature supports connecting some or all blocking conditions together with ORs instead of with ANDs. You can read more documentation about it under the "or_group" bullet point here.
- Unskip several skipped tests by @riley-harper in #139. This is a development change that should not affect users.
Full Changelog: v3.5.5...v3.6.0
v3.5.5
What's Changed
- Support a variable number of columns in the array feature selection transform by @riley-harper in #135
Full Changelog: v3.5.4...v3.5.5
v3.5.4
What's Changed
- Document column_mappings transform concat_two_cols by @riley-harper in #126. These new docs are here: https://hlink.docs.ipums.org/column_mappings.html#concat-two-cols.
- Document column mapping overrides by @riley-harper in #129. These can let you read two columns with different names from the two input files into a single hlink column. Check out the documentation at https://hlink.docs.ipums.org/column_mappings.html#advanced-usage and following.
- Fix a bug with the override_column_X attributes in conf_validations.py by @riley-harper in #131. Previously config validation was raising spurious errors because it didn't take override_column_a and override_column_b into account.
Full Changelog: v3.5.3...v3.5.4
v3.5.3
Highlights
In this release we start supporting Python 3.12 and remove the ceiling on most of our dependency versions to support this. We also fix a bug with one-hot encoding and add an additional check to config validation that looks for duplicate output columns for some config sections.
What's Changed
- Refactor to use colorama in a simpler way by @jrbalch543 in #115. User-facing functionality should be unchanged.
- Add checks for duplicated comparison features, feature selection, and column mappings by @jrbalch543 in #113. This will cause a validation error when there are duplicated aliases or output columns for these sections.
- Clean up a couple of core modules by @jrbalch543 in #117. These changes are internal refactoring and don't affect functionality.
- Upgrade dependencies by pinning them more loosely and support Python 3.12 by @riley-harper in #119. This removes the upper limit on almost all of our dependencies so that users can more freely pick versions for themselves. Our tests run on the most recent available versions for each particular version of Python. Unpinning these dependencies allows us to easily support Python 3.12, which we now support and run in CI/CD.
- Update the docs to include Python 3.12 by @riley-harper in #120
- Revert to handleInvalid = "keep" for OneHotEncoder by @riley-harper in #121. This is a bug that we introduced in the last release. Although it's not common, it does sometimes happen that our training data doesn't cover every category present in the matching data. We would rather silently continue and ignore these cases by giving them a coefficient of 0 than error out on them.
- Put the config file name in the script prompt by @riley-harper in #123. This is a small quality of life feature that makes it easier to remember which config file you're running during long hlink runs.
Full Changelog: v3.5.2...v3.5.3
v3.5.2
What's Changed
- Fixed zipping issue in Training step 3 by @jrbalch543 in #104
- Fix a bug in Training step 3 for categorical features by @jrbalch543 and @riley-harper in #107. Each categorical feature was getting a single coefficient when each category should get its own coefficient instead.
- Error out on invalid categories in training data instead of creating a new category for them by @riley-harper in #109. This bug fix reduces the number of categories created by hlink by 1. The last category represented missing or invalid data, but these categories were pretty much always unused because hlink creates exhaustive categories whenever possible. Users can still manually mark missing data by creating their own category for it, but hlink will not do this by default anymore. This should help prevent silent errors and confusion with missing data.
- Fix a bug where categorical features created by interaction caused Training step 3 to crash by @riley-harper in #111
- Tweak the format of Training step 3's output by @riley-harper in #112. There are now 3 columns: feature_name, category, and coefficient_or_importance. Feature names aren't suffixed with the category value anymore.
Full Changelog: v3.5.1...v3.5.2
v3.5.1
What's Changed
- Implement a new Training step that replaces Model Exploration step 3 by @jrbalch543 and @riley-harper in #101. This new step replaces the broken "get feature importances" step in Model Exploration, which now is removed. Training step 3 saves model feature importances or coefficients when
training.feature_importances
is set to true in the config file.
New Contributors
- @jrbalch543 made their first contribution in #102! 🎉
Full Changelog: v3.5.0...v3.5.1