13 Dec 22:10

8bfe87e

v4.0.0a1 Pre-release

Pre-release

Version 4.0.0 Alpha 1

This pre-release has upcoming changes for version 4 of hlink. Since this includes breaking changes and an overhaul of the model exploration task, we'd like to test it out a bit before creating a full release. Part of the work yet to be done is documentation and code cleanup. The documentation for these changes and new features is lacking so far. Here is a preview of the version 4 highlights (so far!):

Completely overhauled the model exploration task, switching to a nested cross-validation algorithm.
Added support for a third strategy for generating models to test in model exploration. Along with "explicit" (take exactly what's in training.model_parameters) and grid search, there is now randomized search. Randomized search takes a certain number of samples from a distribution defined in training.model_parameters.
Added the F-measure metric to the model exploration output, and simplified the output so that it always has the same columns.
Removed the training.output_suspicious_TD configuration option because it was rarely used and presented code and performance issues. Removing output_suspicious_TD makes the model exploration code more maintainable and helps it run more quickly.
Disentangled two core modules (classifier and pipeline) from the configuration format by changing the arguments to a couple of functions. This should help separate those concerns more neatly and make changes to the configuration easier if we end up doing that in the future.
Changed SparkConnection to require a checkpoint_dir argument, which fixes a bug related to Spark configuration.

Assets 2

04 Dec 20:36

riley-harper

v3.8.0

85a1818

v3.8.0 Latest

Latest

What's Changed

Added optional support for two new gradient boosting ML libraries: XGBoost and LightGBM. You can read more about these libraries and how to install them with their dependencies in the docs here. PR #165
Added a new hlink.linking.transformers.RenameVectorAttributes transformer which can rename the attributes or "slots" of Spark vector columns. Hlink uses this to support LightGBM, which disallows certain characters in its feature names. PR #165
Documented comparisons, which are not the same as comparison features. Previously the documentation was misleading and seemed to indicate that these were the same thing. PR #159
Fixed a bug in the substitution file documentation. The documentation had the meaning of the substitution file columns flip-flopped, which was confusing. PR #166

Developer-Facing Changes

Updated Sphinx to 8.1.3 and fixed two Sphinx build warnings. PR #159
Updated CI/CD to automatically run only on PRs and on pushes to main. You can also now manually trigger a CI/CD run from the Actions tab in GitHub. Also removed the custom "quickcheck" pytest marker in favor of using pytest -k and removed flake8 from CI/CD because it kept causing more trouble than it was worth. PR #164

Full Changelog: v3.7.0...v3.8.0

Assets 2

10 Oct 17:16

riley-harper

v3.7.0

c1713e5

v3.7.0

What's Changed

Add tests to cover several untested sections of code by @riley-harper in #147
Refactor core.transforms.generate_transforms() for readability and maintainability; improve documentation and type hints by @riley-harper in #148
Fix tests for Python 3.12 and clarify Python 3.12 support and dependence on PySpark by @riley-harper in #151
Improve logging by writing to module-level loggers instead of the root logger by @riley-harper in #152
Support setting the app name via an optional argument in SparkConnection. The default behavior of setting the app name to "linking" is unchanged. By @riley-harper in #156
Improve model_exploration step 2 terminal output, logging, and documentation to make the step more understandable by @riley-harper in #155

Full Changelog: v3.6.1...v3.7.0

Contributors

riley-harper

Assets 2

14 Aug 21:26

riley-harper

v3.6.1

54d4820

v3.6.1

What's Changed

Support blocking sections with multiple exploded columns by @riley-harper in #143. This fixes a bug that caused a crash in Matching step 0 - explode.

Full Changelog: v3.6.0...v3.6.1

Contributors

riley-harper

Assets 2

18 Jun 20:12

riley-harper

v3.6.0

94f0e8b

v3.6.0

What's Changed

Support OR conditions in blocking by @riley-harper in #138. This new feature supports connecting some or all blocking conditions together with ORs instead of with ANDs. You can read more documentation about it under the "or_group" bullet point here.
Unskip several skipped tests by @riley-harper in #139. This is a development change that should not affect users.

Full Changelog: v3.5.5...v3.6.0

Contributors

riley-harper

Assets 2

31 May 17:30

riley-harper

v3.5.5

bd69a9e

v3.5.5

What's Changed

Support a variable number of columns in the array feature selection transform by @riley-harper in #135

Full Changelog: v3.5.4...v3.5.5

Contributors

riley-harper

Assets 2

20 Feb 20:20

riley-harper

v3.5.4

da9db20

v3.5.4

What's Changed

Document column_mappings transform concat_two_cols by @riley-harper in #126. These new docs are here: https://hlink.docs.ipums.org/column_mappings.html#concat-two-cols.
Document column mapping overrides by @riley-harper in #129. These can let you read two columns with different names from the two input files into a single hlink column. Check out the documentation at https://hlink.docs.ipums.org/column_mappings.html#advanced-usage and following.
Fix a bug with the override_column_X attributes in conf_validations.py by @riley-harper in #131. Previously config validation was raising spurious errors because it didn't take override_column_a and override_column_b into account.

Full Changelog: v3.5.3...v3.5.4

Contributors

riley-harper

Assets 2

02 Nov 15:58

riley-harper

v3.5.3

c0f0619

v3.5.3

Highlights

In this release we start supporting Python 3.12 and remove the ceiling on most of our dependency versions to support this. We also fix a bug with one-hot encoding and add an additional check to config validation that looks for duplicate output columns for some config sections.

What's Changed

Refactor to use colorama in a simpler way by @jrbalch543 in #115. User-facing functionality should be unchanged.
Add checks for duplicated comparison features, feature selection, and column mappings by @jrbalch543 in #113. This will cause a validation error when there are duplicated aliases or output columns for these sections.
Clean up a couple of core modules by @jrbalch543 in #117. These changes are internal refactoring and don't affect functionality.
Upgrade dependencies by pinning them more loosely and support Python 3.12 by @riley-harper in #119. This removes the upper limit on almost all of our dependencies so that users can more freely pick versions for themselves. Our tests run on the most recent available versions for each particular version of Python. Unpinning these dependencies allows us to easily support Python 3.12, which we now support and run in CI/CD.
Update the docs to include Python 3.12 by @riley-harper in #120
Revert to handleInvalid = "keep" for OneHotEncoder by @riley-harper in #121. This is a bug that we introduced in the last release. Although it's not common, it does sometimes happen that our training data doesn't cover every category present in the matching data. We would rather silently continue and ignore these cases by giving them a coefficient of 0 than error out on them.
Put the config file name in the script prompt by @riley-harper in #123. This is a small quality of life feature that makes it easier to remember which config file you're running during long hlink runs.

Full Changelog: v3.5.2...v3.5.3

Contributors

riley-harper and jrbalch543

Assets 2

26 Oct 15:21

riley-harper

v3.5.2

d51c254

v3.5.2

What's Changed

Fixed zipping issue in Training step 3 by @jrbalch543 in #104
Fix a bug in Training step 3 for categorical features by @jrbalch543 and @riley-harper in #107. Each categorical feature was getting a single coefficient when each category should get its own coefficient instead.
Error out on invalid categories in training data instead of creating a new category for them by @riley-harper in #109. This bug fix reduces the number of categories created by hlink by 1. The last category represented missing or invalid data, but these categories were pretty much always unused because hlink creates exhaustive categories whenever possible. Users can still manually mark missing data by creating their own category for it, but hlink will not do this by default anymore. This should help prevent silent errors and confusion with missing data.
Fix a bug where categorical features created by interaction caused Training step 3 to crash by @riley-harper in #111
Tweak the format of Training step 3's output by @riley-harper in #112. There are now 3 columns: feature_name, category, and coefficient_or_importance. Feature names aren't suffixed with the category value anymore.

Full Changelog: v3.5.1...v3.5.2

Contributors

riley-harper and jrbalch543

Assets 2

23 Oct 20:10

riley-harper

v3.5.1

6711c54

v3.5.1

What's Changed

Implement a new Training step that replaces Model Exploration step 3 by @jrbalch543 and @riley-harper in #101. This new step replaces the broken "get feature importances" step in Model Exploration, which now is removed. Training step 3 saves model feature importances or coefficients when training.feature_importances is set to true in the config file.

New Contributors

@jrbalch543 made their first contribution in #102! 🎉

Full Changelog: v3.5.0...v3.5.1

Contributors

riley-harper and jrbalch543

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 4.0.0 Alpha 1

What's Changed

Developer-Facing Changes

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

Highlights

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

Releases: ipums/hlink

v4.0.0a1

Version 4.0.0 Alpha 1

v3.8.0

What's Changed

Developer-Facing Changes

v3.7.0

What's Changed

Contributors

v3.6.1

What's Changed

Contributors

v3.6.0

What's Changed

Contributors

v3.5.5

What's Changed

Contributors

v3.5.4

What's Changed

Contributors

v3.5.3

Highlights

What's Changed

Contributors

v3.5.2

What's Changed

Contributors

v3.5.1

What's Changed

New Contributors

Contributors