Skip to content

Commit

Permalink
Use Postgres, document all ETL steps
Browse files Browse the repository at this point in the history
As we near launch, I wanted to make sure we could rebuild the analysis
database from scratch.  I found insert and update speeds to be slow with
SQLite, my original database choice, so I wanted to try using Postgres
when I rebuild the database.  While it may be possible to update the
code to work better with SQLite, we don't have a lot of time and I'd
likely lose some of the convenient abstractions of the Django ORM layer.

In the process of this, I updated the README so all steps in the
transform/processing pipeline were reflected.

This commit includes updates to make the processing pipeline work with
Postgres.

One of the big differences is that Postgres seems much more picky about
the field lengths of text fields, which is actually great as a way to
find data quirks, but there were a number of updates that needed to
happen to fix data errors.

When parsing city and state fields, return the state abbreviation, not
the value of the state field.

Hack around false positives when looking up states using the ``us``
package.

Update the ``load_dispositions_csv`` management command to insert
records in batches to avoid the process crashing of Postgres generating
an error.

Also, fix shifted columns due to bad CSV quoting when dispositions are
loaded in ``load_dispositions_csv``.

Update the ``create_dispositions`` management command to process the
RawDisposition instances in batches to avoid running out of memory.

Add assertions in the ``Disposition`` methods that load and parse values
from the ``RawDisposition`` fields to make sure things like state or
charge class are of the correct length.

Cleanly handle exceptions when trying to detect IUCR code for a statute.

Fix the name field of the ``CenusPlace`` model so its 100 chars long
instead of 7.  This was a just a typo in the original code.

You'll have to run ``manage.py migrate convictions_data`` to update your
database with this change.

Add missing ``place`` field to the ``CONVICTIONS_IMPORT_FIELDS`` list
when creating convictions from dispositions.

Update extra SQL in IUCR code queries to work with Postgres.  This is
mostly an issue of how things are quoted.

Remove trailing whitespace throughout modified files.

Update age based queries to use Postgres functions to calculate the age.
This likely breaks things in SQLite.  Also remove
``AgeQuerySetMixin.with_ages`` since Postgres doesn't let you use an
alias in the ``WHERE`` clause as the ``WHERE`` clause is evaluated
first.  Instead, we have to include the age expression directly within
the WHERE clause.  The upside of this is that we can now use the count()
method to get the crimes by type instead of having to use len() and
evaluating the queryset.  This query still runs kind of slow, however.
  • Loading branch information
ghing committed Oct 28, 2014
1 parent 3f51fa0 commit 31e0752
Show file tree
Hide file tree
Showing 10 changed files with 517 additions and 111 deletions.
110 changes: 97 additions & 13 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,24 @@ Quickstart
Installation
------------

Create spatial database
~~~~~~~~~~~~~~~~~~~~~~~

PostGIS

::

$ createdb convictions
$ psql convictions
> CREATE EXTENSION postgis;
> CREATE EXTENSION postgis_topology;

Spatialite

::

spatialite convictions.sqlite3 "SELECT InitSpatialMetaData();"

::

git clone https://github.com/sc3/cook-convictions-data.git
Expand All @@ -20,14 +38,28 @@ Installation
pip install -r requirements.txt
cp convictions/setttings/dev.example.py convictions/settings/dev.py
# Edit convictions/settings/dev.py to fill in the needed variables
spatialite convictions.sqlite3 "SELECT InitSpatialMetaData();"
./manage.py syncdb
./manage.py migrate

We use `DataMade's <http://datamade.us/>`_ `usaddress <https://github.com/datamade/usaddress>`_ package to parse addresses when anonymizing them to the block level. However, the stable version of the package doesn't support Python 3. In a pinch, we use a fork that I made that adds rough Python 3 support. We install this fork as editable, so we need to do the training.

::

workon convictions
cd /path/to/virtualenv/src/usaddress
python training/training.py


Load spatial data
-----------------

First, download and unpack the Shapefile version of the Cook County Municipalities data from https://datacatalog.cookcountyil.gov/GIS-Maps/ccgisdata-Municipality/ta8t-zebk
Download and unpack the Shapefile version of Chicago Community Areas.

Then run::

./manage.py load_spatial_data CommunityArea data/Comm_20Areas/CommAreas.shp

Download and unpack the Shapefile version of the Cook County Municipalities data from https://datacatalog.cookcountyil.gov/GIS-Maps/ccgisdata-Municipality/ta8t-zebk

Then run::

Expand All @@ -45,20 +77,57 @@ Then run::

./manage.py load_spatial_data CensusPlace data/tl_2010_17_place10/tl_2010_17_place10.shp


Load census data
----------------

::

./manage.py load_aff_data CensusTract total_population GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B01003_with_ann__totpop__tracts.csv

./manage.py load_aff_data CensusTract per_capita_income GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B19301_with_ann__per_capita_income__tracts.csv

./manage.py load_aff_data CensusPlace total_population GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B01003_with_ann__totpop__places.csv

./manage.py load_aff_data CensusPlace per_capita_income GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B19301_with_ann__per_capita_income__places.csv

Aggregate census data to Chicago Community Areas
------------------------------------------------

::

./manage.py aggregate_census_fields


Identify suburbs
----------------

::

./manage.py flag_chicago_msa_places data/tl_2010_17_place10_chicago_msa.csv


Load raw dispositions data
--------------------------

This command will also fix known issues with columbs being shifted in some rows due to bad escaping of quoted columns in the raw CSV file.

Note that the ``--delete`` flag removes any previous records.

::

./manage.py load_dispositions_csv data/Criminal_Convictions_ALLCOOK_05-09.csv
./manage.py load_dispositions_csv --delete data/Criminal_Convictions_ALLCOOK_05-09.csv


Populate clean disposition records
----------------------------------

Note that the ``--delete`` flag removes any previous records.

::

./manage.py create_dispositions
./manage.py create_dispositions --delete


Geocode disposition records
---------------------------
Expand All @@ -67,18 +136,22 @@ Geocode disposition records

./manage.py geocode_dispositions

Load census data
----------------

Detect Community Area and Census Place boundaries
-------------------------------------------------

::

./manage.py load_aff_data CensusTract total_population GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B01003_with_ann__totpop__tracts.csv
./manage.py boundarize

./manage.py load_aff_data CensusTract per_capita_income GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B19301_with_ann__per_capita_income__tracts.csv

./manage.py load_aff_data CensusPlace total_population GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B01003_with_ann__totpop__places.csv
Create convictions records from the dispositions
------------------------------------------------

::

./manage.py create_convictions --delete

./manage.py load_aff_data CensusPlace per_capita_income GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B19301_with_ann__per_capita_income__places.csv

Export Community Area and Census Place GeoJSON
----------------------------------------------
Expand All @@ -95,7 +168,7 @@ Extract Chicago's border from a shapefile

::

./manage.py chicago_geojson_from_shp data/tl_2010_17_place10/tl_2010_17_place10.shp > chicago.json
./manage.py chicago_geojson_from_shp data/tl_2010_17_place10/tl_2010_17_place10.shp > chicago.json

Export convictions by age bucket
--------------------------------
Expand All @@ -105,6 +178,16 @@ Export convictions by age bucket
./manage.py export_age_json > convictions_by_age.json


Export disposition data
-----------------------

Export Disposition model records to CSV. Anonymize the data by dropping personal identifier fields and converting address fields to the block. For example, an address number of "2707" would be converted to "2700".

::

./manage.py export_csv > dispositions.csv


Manual Processes
================

Expand All @@ -129,6 +212,7 @@ I created a list of these census places by bringing the TIGER shapefile for Illi

 ogr2ogr -f CSV tl_2010_17_place10_chicago_msa.csv tl_2010_17_place10_chicago_msa/tl_2010_17_place10_chicago_msa.shp


Loading conviction places from dispositions
-------------------------------------------

Expand All @@ -140,11 +224,11 @@ Because we added places mid-process, I didn't want to re-create Conviction recor
Other datasets
==============

* `Boundaries - Community Areas (current) <https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6>`_
* `Boundaries - Community Areas (current) <https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6>`_
* `Cook County Municipalities <https://datacatalog.cookcountyil.gov/GIS-Maps/ccgisdata-Municipality/ta8t-zebk>`_
* `Boundaries - Census Tracts - 2010 <https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Census-Tracts-2010/5jrd-6zik>`_
* `2010 Illinois Census Place TIGER Shapefile <http://www2.census.gov/geo/tiger/TIGER2010/PLACE/2010/tl_2010_17_place10.zip>`_
* 2010 ACS 5-year Estimates "TOTAL POPULATION" (B01003) for Cook County Census Tracts
* 2010 ACS 5-year Estimates "TOTAL POPULATION" (B01003) for Illinois Census Places
* 2010 ACS 5-year Estimates "TOTAL POPULATION" (B01003) for Illinois Census Places
* 2010 ACS 5-year Estimates "PER CAPITA INCOME IN THE PAST 12 MONTHS (IN 2010 INFLATION-ADJUSTED DOLLARS)" (B19301) for Cook County Census Tracts
* `2010 ACS 5-year Estimates "PER CAPITA INCOME IN THE PAST 12 MONTHS (IN 2010 INFLATION-ADJUSTED DOLLARS)" (B19301) for Illinois Census Places <http://factfinder2.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_10_5YR_B19301&prodType=table>`_
17 changes: 14 additions & 3 deletions convictions_data/cleaner.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,30 @@ class CityStateSplitter(object):
# Strings that represent states but are not official abbreviations
MOCK_STATES = set(['ILL', 'I', 'MX'])

# HACK: These give a false positive when trying to match against state names
NOT_STATES = set(['MONEE'])

@classmethod
def split_city_state(cls, city_state):
city_state = cls.PUNCTUATION_RE.sub(' ', city_state)
bits = re.split(r'\s+', city_state.strip())

last = bits[-1]

if us.states.lookup(last) or last in cls.MOCK_STATES:
state = last
state_lookup = us.states.lookup(last)
if last not in cls.NOT_STATES and (state_lookup or last in cls.MOCK_STATES):
if state_lookup:
state = state_lookup.abbr
else:
state = last
city_bits = bits[:-1]
elif len(last) >= 2 and (us.states.lookup(last[-2:]) or
last[-2:] in cls.MOCK_STATES):
state = last[-2:]
state_lookup = us.states.lookup(last[-2:])
if state_lookup:
state = state_lookup.abbr
else:
state = last[-2:]
city_bits = bits[:-1] + [last[:-2]]
else:
state = ""
Expand Down
1 change: 0 additions & 1 deletion convictions_data/management/commands/create_convictions.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ def handle(self, *args, **options):
Conviction.objects.all().delete()
Disposition.objects.in_analysis().update(conviction=None)

# TODO: Update this once we clean the misaligned data
qs = Disposition.objects.in_analysis().filter(chrgclass__regex=r'^[A-Z0-9]{0,1}$')

with transaction.atomic():
Expand Down
19 changes: 15 additions & 4 deletions convictions_data/management/commands/create_dispositions.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,32 @@
class Command(BaseCommand):
help = "Create clean disposition records from raw data"

BATCH_SIZE = 5000

option_list = BaseCommand.option_list + (
make_option('--delete',
action='store_true',
dest='delete',
default=False,
help="Delete previously created models",
),
make_option('--batch-size',
action='store',
type='int',
default=BATCH_SIZE,
dest='batch_size',
help="Process in batches of this number of records"),
)

def handle(self, *args, **options):
if options['delete']:
Disposition.objects.all().delete()

models = []
for rd in RawDisposition.objects.all():
models.append(Disposition(raw_disposition=rd))
num_disps = RawDisposition.objects.count()
for i in range(0, num_disps, options['batch_size']):
models = []
raw_disps = RawDisposition.objects.all().order_by('case_number')[i:i+options['batch_size']]
for rd in raw_disps:
models.append(Disposition(raw_disposition=rd))

Disposition.objects.bulk_create(models)
Disposition.objects.bulk_create(models)
84 changes: 83 additions & 1 deletion convictions_data/management/commands/load_dispositions_csv.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import csv
import logging
from optparse import make_option

from django.core.management.base import BaseCommand
Expand All @@ -9,13 +10,23 @@ class Command(BaseCommand):
args = "<csv_filename>"
help = "Load raw dispositions CSV into database models"

# Number of records to insert at once as it fails if we try to insert
# all the records at once
BATCH_SIZE = 5000

option_list = BaseCommand.option_list + (
make_option('--delete',
action='store_true',
dest='delete',
default=False,
help="Delete previously loaded models",
),
make_option('--batch-size',
action='store',
type='int',
default=BATCH_SIZE,
dest='batch_size',
help="Process in batches of this number of records"),
)

def handle(self, *args, **options):
Expand All @@ -31,4 +42,75 @@ def handle(self, *args, **options):
model_kwargs = {k.lower():v for k, v in row.items()}
models.append(RawDisposition(**model_kwargs))

RawDisposition.objects.bulk_create(models)
for i in range(0, len(models), options['batch_size']):
RawDisposition.objects.bulk_create(models[i:i+options['batch_size']])

self.fix_shifted()

def fix_shifted(self):
"""Fix columns that were shifted due to bad escaping in the CSV"""
# HACK: This is overly verbose and could be generalized. For now,
# just do this explicitly, but if we run into more examples that need
# this, we should make a more general solution for shifting columns

# TODO: Move this into a separate management command
bad_chrgdesc = "RIFLE <16''/SHOTGUN <18\",F\""
disps = RawDisposition.objects.filter(chrgdesc=bad_chrgdesc)
for disp in disps:
disp.amtoffine = disp.maxsent
disp.maxsent = disp.minsent
disp.ammndchrgclass = disp.ammndchrgtype
disp.ammndchrgtype = disp.ammndchrgdescr
disp.ammndchrgdescr = disp.ammndchargstatute
disp.ammndchargstatute = disp.chrgdispdate
disp.chrgdispdate = disp.chrgdisp
disp.chrgdisp = disp.chrgclass
disp.chrgclass = disp.chrgtype2
disp.chrgtype2 = disp.chrgtype
disp.chrgtype = "F"
disp.chrgdesc = "RIFLE <16''/SHOTGUN <18\""
logging.info("Fixing shifted cells due to chrgdesc in RawDisposition "
"with pk {}".format(disp.pk))
disp.save()

disps = RawDisposition.objects.filter(ammndchrgdescr=bad_chrgdesc)
for disp in disps:
disp.amtoffine = disp.maxsent
disp.maxsent = disp.minsent
disp.ammndchrgclass = disp.ammndchrgtype
disp.ammndchrgtype = "F"
disp.ammndchrgdescr = "RIFLE <16''/SHOTGUN <18\""
logging.info("Fixing shifted cells due to ammndchrgdescr in RawDisposition "
"with pk {}".format(disp.pk))
disp.save()

bad_address = "10716 S AVENUE M\",CHICAGO IL\""
disps = RawDisposition.objects.filter(st_address=bad_address)
for disp in disps:
disp.amtoffine = disp.maxsent
disp.maxsent = disp.minsent
disp.ammndchrgclass = disp.ammndchrgtype
disp.ammndchrgtype = disp.ammndchrgdescr
disp.ammndchrgdescr = disp.ammndchargstatute
disp.ammndchargstatute = disp.chrgdispdate
disp.chrgdispdate = disp.chrgdisp
disp.chrgdisp = disp.chrgclass
disp.chrgclass = disp.chrgtype2
disp.chrgtype2 = disp.chrgtype
disp.chrgtype = disp.chrgdesc
disp.chrgdesc = disp.statute
disp.statute = disp.sex
disp.sex = disp.initial_date
disp.initial_date = disp.arrest_date
disp.arrest_date = disp.dob
disp.dob = disp.fbiidno
disp.fbiidno = disp.statepoliceid
disp.statepoliceid = disp.fgrprntno
disp.fgrprntno = disp.ctlbkngno
disp.ctlbkngno = disp.zipcode
disp.zipcode = disp.city_state
disp.city_state = "CHICAGO, IL"
disp.st_address = "10716 S AVENUE M"
logging.info("Fixing shifted cells due to st_address in RawDisposition "
"with pk {}".format(disp.pk))
disp.save()
Loading

0 comments on commit 31e0752

Please sign in to comment.