Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize precinct identifier format #144

Open
nvkelso opened this issue Jul 31, 2018 · 9 comments
Open

Standardize precinct identifier format #144

nvkelso opened this issue Jul 31, 2018 · 9 comments

Comments

@nvkelso
Copy link
Owner

nvkelso commented Jul 31, 2018

Should generally be state fips (AA) & county fips (AAA) & precinct id (AAAAAAA*).

Sometimes there is both a precinct name and ID, perhaps we should include both variants? (Though extra columns inflates the DBF).

@migurski
Copy link
Collaborator

migurski commented Aug 1, 2018

Those precinct IDs come from the Census, but only in cases where a state participated in the 2010 VTD program right?

@nvkelso
Copy link
Owner Author

nvkelso commented Aug 1, 2018 via email

@migurski
Copy link
Collaborator

migurski commented Aug 1, 2018

For PlanScore, I’ve been assigning them artisinal integers. Works really well internally but not something I’ve exposed generally.

@nvkelso
Copy link
Owner Author

nvkelso commented Aug 1, 2018 via email

@nvkelso
Copy link
Owner Author

nvkelso commented Sep 3, 2018

In #146: All state IDs are now FIPS codes in #146, and there's a common field format (2 char for state, 32 char for county (which should be ssCCC but some data comes as longer name strings and that's not normalized yet), and 255 char precinct (should be normalized, but same as county).

@sigpwned
Copy link
Contributor

sigpwned commented Sep 16, 2018

First, let me say how thrilled I was when I came across this project. Because it contains preinct-level geodata for the whole country, I think it can be the hub for any GIS or map election data project. I know it's a great starting off point for some work I plan to do!

Regarding precincts, I reviewed precinct labels from a number of states and I was disappointed to find that there is little shared rhyme or reason among them. Some use numeric codes; some use physical location names, like "city hall"; some use a combination of the two; others seem not to include labels at all. If the goal is to standardize precinct labels in a way more general than "uppercase, split on non-alphanumeric and join with single whitespace," this project will have to come up with its own novel naming scheme. I'm not sure there is a "right" answer, on face.

However, I think we can optimize the labeling for some common use cases. The work I plan to do involves joining this data set to other precinct-level data sets, e.g. data sets from here. I think a good way of standardizing the labels would be:

  1. Look for other other precinct-level data sets that are available
  2. Study how they label their precincts
  3. Choose a method that makes joining to as many different data sets as easy as possible

For example, let's say we find 10 such data sets. It's likely they'll all be at least a little different. But if we find that they all use place names to identify precincts, then we'd want to make sure to preserve place names when they're available in this data set. Because all the data sets will be different we won't find any scheme that's perfect, but we can at least find some objective measure for "better."

I also think that what @nvkelso about data crosswalks is really important. If this data set is going to become a hub, then it needs to be as easy for other people to pick up and use for their own purposes as possible. To that point, I think that encouraging people to publish any crosswalks they create would be A Good Thing. (For example, when I do the join to the data sets linked above, I'll be happy to share a "join table" that maps this data set to those data sets.) Those joins make this data more useful; the joined data more useful; and any data that joins to either can now be mapped to both.

Here are some data sets that I think it could be useful to review when trying to decide on a standard. I'm sure there are others, but hopefully these are a good start:

Just a couple of thoughts I had while elsewhere in the data set, for whatever they're worth. Hopefully they make sense.

@nvkelso
Copy link
Owner Author

nvkelso commented Sep 17, 2018

Hi @sigpwned, thanks for your kind words and thoughtful comments. I really like the idea of x-walk concordance "join" tables with other precinct datasets.

I've been wondering if this project should allow both precinct "identifier" and precinct "name" columns when both those are available in the upstream sources to make this a little easier.

@sigpwned
Copy link
Contributor

sigpwned commented Sep 17, 2018

I've been wondering if this project should allow both precinct "identifier" and precinct "name" columns when both those are available in the upstream sources to make this a little easier.

That's an interesting idea! And it's knocked some ideas loose for me. Let me try to dump my brain while the thoughts are fresh.

Based on my understanding, the goal of this project is:

Every US precinct is represented by one record in the dataset with a unique (state_fips, county_fips, precinct_id) key.

Here are a few thoughts on getting there:

  1. All records should now have state FIPS codes.
  2. All records with counties should now have county FIPS codes, or will soon, per Standardize use of name and FIPS code in state and county fields #135.
  3. All records without counties should receive county labels soon, per Standardize use of name and FIPS code in state and county fields #135.
  4. There are duplicate (state_fips, county_fips, precinct_id) keys in the data set.
  5. Some records have no precinct_id.

4 and 5 above are potentially significant issues.

Regarding 4, it's difficult to know if these "duplicate" rows represent one precinct with the region split into multiple geometries, or if the rows are actually mislabeled. The only way I can think of to make that determination is to compare this data to other precinct-level data. Once we know that:

  • If the rows represent one precinct, then I recommend we merge the duplicates into one row having the ST_Union of their respective geometries.
  • If the rows are mislabeled, then I recommend we change the precinct_id labels to make them unique, e.g. by appending A, B, C, and so on.

Regarding 5, it's much like 4, except that all precincts should be treated as having the same label. Teasing these apart into "real" labels is going to be fairly manual work, unfortunately. We probably can't cheat by comparing to a "known good" precinct data set because if that data set existed, presumably we'd be using that instead of the data we have. At the very least, we should be able to use this map or one like it to do the assignments.

We're free to assign any IDs to updated rows we like. I think it would be wise to make those IDs look as much like other precinct ID labels as possible, but the reality is that new IDs are completely at our discretion. Any crosswalks we publish are essentially a relabel anyway, so users can substitute new labels if they wish.

Regarding keeping two precinct_id columns, I think it's a fine idea, but ultimately users will have to pick one column for any work they do. Fundamentally, it would be our first crosswalk, so we can publish that separately if we want to, or leave it integrated into the data set as a separate column. They're basically the same thing.

In any case, I think the plan of attack here should be to finish out #135 since we're close, and then generate a report sizing up 4 and 5 above, per state. We won't really know how much work this step will be until we have that report.

Just my two cents. How does that seem to everyone else?

@sigpwned
Copy link
Contributor

sigpwned commented Oct 2, 2018

Once we have #135 closed and the $state_fips$county_fips vs $county_fips format standardized, I see this issue as the next "big thing." Any thoughts on the above? With the benefit of more thought, I'm more confident that trying to standardize the precinct values is probably not useful, because they're so different.

Here's where I think we are:

I think fixing the last two above are top priority. I'm not sure what the best way to approach that is, per the above, but I haven't though too hard on it yet either.

Once this is handled however is deemed best, I think the next priority would be building crosswalks everywhere. How easy that is will probably depend on how we do this work.

Again, just my two cents. What does everyone else think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants