[#62] Proofread the tutorial README

ipums · Dec 1, 2022 · a3b32a8 · a3b32a8
1 parent 3fdf929
commit a3b32a8
Showing 1 changed file with 14 additions and 15 deletions.
diff --git a/examples/tutorial/README.md b/examples/tutorial/README.md
@@ -23,21 +23,27 @@ dataset A to be aged 50 in dataset B.
 
 ## The Config File and Linking Strategy
 
-To link these two datasets, we a configuration file that directs
+To link these two datasets, we need a configuration file that directs
 hlink on what operations to perform and how to determine when a link is found. For
 our tutorial example, we'll use deterministic linking, but hlink is also capable
 of using machine learning models to classify possible links between the datasets.
 
 In this section we'll walk through the process of creating the tutorial\_config.toml
 file that can be found in this directory. Creating a config file can be complicated.
-See the hlink documentation for a detailed explanation of the different config
-file sections and keys.
+See the [hlink documentation](https://hlink.docs.ipums.org) for a detailed
+explanation of the different config file sections and keys.
 
 The first step in creating a config file is describing the data to hlink. The
 `id_column` key tells hlink the name of the id column which uniquely identifies
 each record in a dataset. In our case, this is "ID". The `datasource_a`
 and `datasource_b` sections give hlink information about where to find the input
-files. We give hlink the relative path to our data files in these sections.
+files. We give hlink the relative path to our data files in these sections. Each
+column that we want to read from the dataset files into hlink must appear in a
+`column_mappings` section. By default a `column_mappings` section reads in the
+column unchanged, but it can also be used to perform some preprocessing and
+cleaning on the column as it is read in. In our config file, we have hlink
+lowercase names and strip leading and trailing whitespace to support comparability
+between the datasets.
 
 After describing the data to hlink, we need to think about our linking strategy.
 How will we determine who links between the two datasets? Do we need to do any
@@ -76,13 +82,6 @@ threshold of 0.84. If a single record pair reaches both thresholds, then we call
 it a link! This pair of records will end up in `potential_matches.csv` when the
 script completes.
 
-In the real world, it's very likely that the names in dataset A and dataset B
-are not consistently formatted. This is where the `column_mappings` section
-comes in. It tells hlink to perform some data cleaning in the preprocessing step
-before matching occurs. The column mappings in the tutorial config file strip
-whitespace from the names and lowercase them to remove discrepancies in formatting
-between the two datasets.
-
 Now that the config file is written, we can run hlink to generate some links. See
 the next section for a description of the tutorial script that runs hlink.
 
@@ -93,11 +92,11 @@ hlink to generate potential matches between the two datasets. It creates a `Link
 which is the main way to control the hlink library. After analyzing the
 config file for errors, it runs two link tasks: preprocessing and matching.
 
-The preprocessing task reads the data from the datasets in and does the data
+The preprocessing task reads in the data from the datasets and does the data
 cleaning and column mapping that we've asked it to do for us in the config file.
 
 The matching task does the real linking work, finding links between the two datasets.
-It stores its results in a `potential_matches` spark table. The script saves this
+It stores its results in a `potential_matches` Spark table. The script saves this
 table to the `potential_matches.csv` file and prints it to the screen.
 
 ## Getting and Interpreting Results
@@ -113,10 +112,10 @@ that they look reasonable. Some links may be more reasonable than others!
 
 - After running the tutorial script once, run it again. This time it should print
 statements like `Preexisting table: raw_df_a`. If hlink finds that a Spark table
-already exists when it goes to compute it, it will use the pre-existing table
+already exists when it goes to compute it, it will use the preexisting table
 instead of recomputing it. To prevent this from happening, try passing the
 `--clean` argument to tutorial.py. This will tell the script to drop all of the
-pre-existing tables before it runs the linking job.
+preexisting tables before it runs the linking job.
 
 - Try increasing or decreasing the Jaro-Winkler thresholds in the config file.
 How does this affect the matches that are generated?