Skip to content

Commit

Permalink
[#62] Proofread the tutorial README
Browse files Browse the repository at this point in the history
  • Loading branch information
riley-harper committed Dec 1, 2022
1 parent 3fdf929 commit a3b32a8
Showing 1 changed file with 14 additions and 15 deletions.
29 changes: 14 additions & 15 deletions examples/tutorial/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,21 +23,27 @@ dataset A to be aged 50 in dataset B.

## The Config File and Linking Strategy

To link these two datasets, we a configuration file that directs
To link these two datasets, we need a configuration file that directs
hlink on what operations to perform and how to determine when a link is found. For
our tutorial example, we'll use deterministic linking, but hlink is also capable
of using machine learning models to classify possible links between the datasets.

In this section we'll walk through the process of creating the tutorial\_config.toml
file that can be found in this directory. Creating a config file can be complicated.
See the hlink documentation for a detailed explanation of the different config
file sections and keys.
See the [hlink documentation](https://hlink.docs.ipums.org) for a detailed
explanation of the different config file sections and keys.

The first step in creating a config file is describing the data to hlink. The
`id_column` key tells hlink the name of the id column which uniquely identifies
each record in a dataset. In our case, this is "ID". The `datasource_a`
and `datasource_b` sections give hlink information about where to find the input
files. We give hlink the relative path to our data files in these sections.
files. We give hlink the relative path to our data files in these sections. Each
column that we want to read from the dataset files into hlink must appear in a
`column_mappings` section. By default a `column_mappings` section reads in the
column unchanged, but it can also be used to perform some preprocessing and
cleaning on the column as it is read in. In our config file, we have hlink
lowercase names and strip leading and trailing whitespace to support comparability
between the datasets.

After describing the data to hlink, we need to think about our linking strategy.
How will we determine who links between the two datasets? Do we need to do any
Expand Down Expand Up @@ -76,13 +82,6 @@ threshold of 0.84. If a single record pair reaches both thresholds, then we call
it a link! This pair of records will end up in `potential_matches.csv` when the
script completes.

In the real world, it's very likely that the names in dataset A and dataset B
are not consistently formatted. This is where the `column_mappings` section
comes in. It tells hlink to perform some data cleaning in the preprocessing step
before matching occurs. The column mappings in the tutorial config file strip
whitespace from the names and lowercase them to remove discrepancies in formatting
between the two datasets.

Now that the config file is written, we can run hlink to generate some links. See
the next section for a description of the tutorial script that runs hlink.

Expand All @@ -93,11 +92,11 @@ hlink to generate potential matches between the two datasets. It creates a `Link
which is the main way to control the hlink library. After analyzing the
config file for errors, it runs two link tasks: preprocessing and matching.

The preprocessing task reads the data from the datasets in and does the data
The preprocessing task reads in the data from the datasets and does the data
cleaning and column mapping that we've asked it to do for us in the config file.

The matching task does the real linking work, finding links between the two datasets.
It stores its results in a `potential_matches` spark table. The script saves this
It stores its results in a `potential_matches` Spark table. The script saves this
table to the `potential_matches.csv` file and prints it to the screen.

## Getting and Interpreting Results
Expand All @@ -113,10 +112,10 @@ that they look reasonable. Some links may be more reasonable than others!

- After running the tutorial script once, run it again. This time it should print
statements like `Preexisting table: raw_df_a`. If hlink finds that a Spark table
already exists when it goes to compute it, it will use the pre-existing table
already exists when it goes to compute it, it will use the preexisting table
instead of recomputing it. To prevent this from happening, try passing the
`--clean` argument to tutorial.py. This will tell the script to drop all of the
pre-existing tables before it runs the linking job.
preexisting tables before it runs the linking job.

- Try increasing or decreasing the Jaro-Winkler thresholds in the config file.
How does this affect the matches that are generated?

0 comments on commit a3b32a8

Please sign in to comment.