Use Postgres, document all ETL steps

As we near launch, I wanted to make sure we could rebuild the analysis database from scratch. I found insert and update speeds to be slow with SQLite, my original database choice, so I wanted to try using Postgres when I rebuild the database. While it may be possible to update the code to work better with SQLite, we don't have a lot of time and I'd likely lose some of the convenient abstractions of the Django ORM layer. In the process of this, I updated the README so all steps in the transform/processing pipeline were reflected. This commit includes updates to make the processing pipeline work with Postgres. One of the big differences is that Postgres seems much more picky about the field lengths of text fields, which is actually great as a way to find data quirks, but there were a number of updates that needed to happen to fix data errors. When parsing city and state fields, return the state abbreviation, not the value of the state field. Hack around false positives when looking up states using the ``us`` package. Update the ``load_dispositions_csv`` management command to insert records in batches to avoid the process crashing of Postgres generating an error. Also, fix shifted columns due to bad CSV quoting when dispositions are loaded in ``load_dispositions_csv``. Update the ``create_dispositions`` management command to process the RawDisposition instances in batches to avoid running out of memory. Add assertions in the ``Disposition`` methods that load and parse values from the ``RawDisposition`` fields to make sure things like state or charge class are of the correct length. Cleanly handle exceptions when trying to detect IUCR code for a statute. Fix the name field of the ``CenusPlace`` model so its 100 chars long instead of 7. This was a just a typo in the original code. You'll have to run ``manage.py migrate convictions_data`` to update your database with this change. Add missing ``place`` field to the ``CONVICTIONS_IMPORT_FIELDS`` list when creating convictions from dispositions. Update extra SQL in IUCR code queries to work with Postgres. This is mostly an issue of how things are quoted. Remove trailing whitespace throughout modified files. Update age based queries to use Postgres functions to calculate the age. This likely breaks things in SQLite. Also remove ``AgeQuerySetMixin.with_ages`` since Postgres doesn't let you use an alias in the ``WHERE`` clause as the ``WHERE`` clause is evaluated first. Instead, we have to include the age expression directly within the WHERE clause. The upside of this is that we can now use the count() method to get the crimes by type instead of having to use len() and evaluating the queryset. This query still runs kind of slow, however.
ghing · Oct 28, 2014 · 31e0752 · 31e0752
1 parent 3f51fa0
commit 31e0752
Show file tree

Hide file tree

Showing 10 changed files with 517 additions and 111 deletions.
diff --git a/README.rst b/README.rst
@@ -12,6 +12,24 @@ Quickstart
 Installation
 ------------
 
+Create spatial database
+~~~~~~~~~~~~~~~~~~~~~~~
+
+PostGIS
+
+::
+
+    $ createdb convictions
+    $ psql convictions
+    > CREATE EXTENSION postgis;
+    > CREATE EXTENSION postgis_topology;
+
+Spatialite
+
+::
+
+    spatialite convictions.sqlite3 "SELECT InitSpatialMetaData();"
+
 ::
 
     git clone https://github.com/sc3/cook-convictions-data.git
@@ -20,14 +38,28 @@ Installation
     pip install -r requirements.txt
     cp convictions/setttings/dev.example.py convictions/settings/dev.py
     # Edit convictions/settings/dev.py to fill in the needed variables
-    spatialite convictions.sqlite3 "SELECT InitSpatialMetaData();"
     ./manage.py syncdb
     ./manage.py migrate
 
+We use `DataMade's <http://datamade.us/>`_ `usaddress <https://github.com/datamade/usaddress>`_ package to parse addresses when anonymizing them to the block level.  However, the stable version of the package doesn't support Python 3. In a pinch, we use a fork that I made that adds rough Python 3 support.  We install this fork as editable, so we need to do the training.
+
+::
+
+    workon convictions
+    cd /path/to/virtualenv/src/usaddress
+    python training/training.py
+
+
 Load spatial data
 -----------------
 
-First, download and unpack the Shapefile version of the Cook County Municipalities data from https://datacatalog.cookcountyil.gov/GIS-Maps/ccgisdata-Municipality/ta8t-zebk
+Download and unpack the Shapefile version of Chicago Community Areas.
+
+Then run::
+
+     ./manage.py load_spatial_data CommunityArea data/Comm_20Areas/CommAreas.shp
+
+Download and unpack the Shapefile version of the Cook County Municipalities data from https://datacatalog.cookcountyil.gov/GIS-Maps/ccgisdata-Municipality/ta8t-zebk
 
 Then run::
 
@@ -45,20 +77,57 @@ Then run::
 
     ./manage.py load_spatial_data CensusPlace data/tl_2010_17_place10/tl_2010_17_place10.shp
 
+
+Load census data
+----------------
+
+::
+
+    ./manage.py load_aff_data CensusTract total_population GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B01003_with_ann__totpop__tracts.csv
+
+    ./manage.py load_aff_data CensusTract per_capita_income GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B19301_with_ann__per_capita_income__tracts.csv
+
+    ./manage.py load_aff_data CensusPlace total_population GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B01003_with_ann__totpop__places.csv
+
+    ./manage.py load_aff_data CensusPlace per_capita_income GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B19301_with_ann__per_capita_income__places.csv
+
+Aggregate census data to Chicago Community Areas
+------------------------------------------------
+
+::
+
+    ./manage.py aggregate_census_fields
+
+
+Identify suburbs
+----------------
+
+::
+
+    ./manage.py flag_chicago_msa_places data/tl_2010_17_place10_chicago_msa.csv
+
+
 Load raw dispositions data
 --------------------------
 
+This command will also fix known issues with columbs being shifted in some rows due to bad escaping of quoted columns in the raw CSV file.
+
+Note that the ``--delete`` flag removes any previous records.
+
 ::
 
-    ./manage.py load_dispositions_csv data/Criminal_Convictions_ALLCOOK_05-09.csv
+    ./manage.py load_dispositions_csv --delete data/Criminal_Convictions_ALLCOOK_05-09.csv
 
 
 Populate clean disposition records
 ----------------------------------
 
+Note that the ``--delete`` flag removes any previous records.
+
 ::
 
-    ./manage.py create_dispositions
+    ./manage.py create_dispositions --delete
+
 
 Geocode disposition records
 ---------------------------
@@ -67,18 +136,22 @@ Geocode disposition records
 
     ./manage.py geocode_dispositions
 
-Load census data
-----------------
+
+Detect Community Area and Census Place boundaries
+-------------------------------------------------
 
 ::
 
-    ./manage.py load_aff_data CensusTract total_population GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B01003_with_ann__totpop__tracts.csv
+    ./manage.py boundarize
 
-    ./manage.py load_aff_data CensusTract per_capita_income GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B19301_with_ann__per_capita_income__tracts.csv
 
-    ./manage.py load_aff_data CensusPlace total_population GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B01003_with_ann__totpop__places.csv
+Create convictions records from the dispositions
+------------------------------------------------
+
+::
+
+    ./manage.py create_convictions --delete
 
-    ./manage.py load_aff_data CensusPlace per_capita_income GEO.id2 HD01_VD01 HD02_VD01 data/ACS_10_5YR_B19301_with_ann__per_capita_income__places.csv
 
 Export Community Area and Census Place GeoJSON
 ----------------------------------------------
@@ -95,7 +168,7 @@ Extract Chicago's border from a shapefile
 
 ::
 
-    ./manage.py chicago_geojson_from_shp data/tl_2010_17_place10/tl_2010_17_place10.shp > chicago.json 
+    ./manage.py chicago_geojson_from_shp data/tl_2010_17_place10/tl_2010_17_place10.shp > chicago.json
 
 Export convictions by age bucket
 --------------------------------
@@ -105,6 +178,16 @@ Export convictions by age bucket
    ./manage.py export_age_json > convictions_by_age.json
 
 
+Export disposition data
+-----------------------
+
+Export Disposition model records to CSV.  Anonymize the data by dropping personal identifier fields and converting address fields to the block.  For example, an address number of "2707" would be converted to "2700".
+
+::
+
+    ./manage.py export_csv > dispositions.csv
+
+
 Manual Processes
 ================
 
@@ -129,6 +212,7 @@ I created a list of these census places by bringing the TIGER shapefile for Illi
 
      ogr2ogr -f CSV tl_2010_17_place10_chicago_msa.csv tl_2010_17_place10_chicago_msa/tl_2010_17_place10_chicago_msa.shp
 
+
 Loading conviction places from dispositions
 -------------------------------------------
 
@@ -140,11 +224,11 @@ Because we added places mid-process, I didn't want to re-create Conviction recor
 Other datasets
 ==============
 
-* `Boundaries - Community Areas (current) <https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6>`_ 
+* `Boundaries - Community Areas (current) <https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6>`_
 * `Cook County Municipalities <https://datacatalog.cookcountyil.gov/GIS-Maps/ccgisdata-Municipality/ta8t-zebk>`_
 * `Boundaries - Census Tracts - 2010 <https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Census-Tracts-2010/5jrd-6zik>`_
 * `2010 Illinois Census Place TIGER Shapefile <http://www2.census.gov/geo/tiger/TIGER2010/PLACE/2010/tl_2010_17_place10.zip>`_
 * 2010 ACS 5-year Estimates "TOTAL POPULATION" (B01003) for Cook County Census Tracts
-* 2010 ACS 5-year Estimates "TOTAL POPULATION" (B01003) for Illinois Census Places 
+* 2010 ACS 5-year Estimates "TOTAL POPULATION" (B01003) for Illinois Census Places
 * 2010 ACS 5-year Estimates "PER CAPITA INCOME IN THE PAST 12 MONTHS (IN 2010 INFLATION-ADJUSTED DOLLARS)" (B19301) for Cook County Census Tracts
 * `2010 ACS 5-year Estimates "PER CAPITA INCOME IN THE PAST 12 MONTHS (IN 2010 INFLATION-ADJUSTED DOLLARS)" (B19301) for Illinois Census Places <http://factfinder2.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_10_5YR_B19301&prodType=table>`_
diff --git a/convictions_data/cleaner.py b/convictions_data/cleaner.py
@@ -9,19 +9,30 @@ class CityStateSplitter(object):
     # Strings that represent states but are not official abbreviations
     MOCK_STATES = set(['ILL', 'I', 'MX'])
 
+    # HACK: These give a false positive when trying to match against state names
+    NOT_STATES = set(['MONEE'])
+
     @classmethod
     def split_city_state(cls, city_state):
         city_state = cls.PUNCTUATION_RE.sub(' ', city_state)
         bits = re.split(r'\s+', city_state.strip())
 
         last = bits[-1]
 
-        if us.states.lookup(last) or last in cls.MOCK_STATES:
-            state = last 
+        state_lookup = us.states.lookup(last)
+        if last not in cls.NOT_STATES and (state_lookup or last in cls.MOCK_STATES):
+            if state_lookup:
+                state = state_lookup.abbr
+            else:
+                state = last
             city_bits = bits[:-1]
         elif len(last) >= 2 and (us.states.lookup(last[-2:]) or
                 last[-2:] in cls.MOCK_STATES):
-            state = last[-2:]
+            state_lookup = us.states.lookup(last[-2:])
+            if state_lookup:
+                state = state_lookup.abbr
+            else:
+                state = last[-2:]
             city_bits = bits[:-1] + [last[:-2]]
         else:
             state = ""

diff --git a/convictions_data/management/commands/create_convictions.py b/convictions_data/management/commands/create_convictions.py
@@ -22,7 +22,6 @@ def handle(self, *args, **options):
             Conviction.objects.all().delete()
             Disposition.objects.in_analysis().update(conviction=None)
 
-        # TODO: Update this once we clean the misaligned data
         qs = Disposition.objects.in_analysis().filter(chrgclass__regex=r'^[A-Z0-9]{0,1}$')
 
         with transaction.atomic():

diff --git a/convictions_data/management/commands/create_dispositions.py b/convictions_data/management/commands/create_dispositions.py
@@ -7,21 +7,32 @@
 class Command(BaseCommand):
     help = "Create clean disposition records from raw data"
 
+    BATCH_SIZE = 5000
+
     option_list = BaseCommand.option_list + (
         make_option('--delete',
             action='store_true',
             dest='delete',
             default=False,
             help="Delete previously created models",
         ),
+        make_option('--batch-size',
+            action='store',
+            type='int',
+            default=BATCH_SIZE,
+            dest='batch_size',
+            help="Process in batches of this number of records"),
     )
 
     def handle(self, *args, **options):
         if options['delete']:
             Disposition.objects.all().delete()
 
-        models = []
-        for rd in RawDisposition.objects.all():
-            models.append(Disposition(raw_disposition=rd))
+        num_disps = RawDisposition.objects.count()
+        for i in range(0, num_disps, options['batch_size']):
+            models = []
+            raw_disps = RawDisposition.objects.all().order_by('case_number')[i:i+options['batch_size']]
+            for rd in raw_disps:
+                models.append(Disposition(raw_disposition=rd))
 
-        Disposition.objects.bulk_create(models)
+            Disposition.objects.bulk_create(models)
diff --git a/convictions_data/management/commands/load_dispositions_csv.py b/convictions_data/management/commands/load_dispositions_csv.py
@@ -1,4 +1,5 @@
 import csv
+import logging
 from optparse import make_option
 
 from django.core.management.base import BaseCommand
@@ -9,13 +10,23 @@ class Command(BaseCommand):
     args = "<csv_filename>"
     help = "Load raw dispositions CSV into database models"
 
+    # Number of records to insert at once as it fails if we try to insert
+    # all the records at once
+    BATCH_SIZE = 5000
+
     option_list = BaseCommand.option_list + (
         make_option('--delete',
             action='store_true',
             dest='delete',
             default=False,
             help="Delete previously loaded models",
         ),
+        make_option('--batch-size',
+            action='store',
+            type='int',
+            default=BATCH_SIZE,
+            dest='batch_size',
+            help="Process in batches of this number of records"),
     )
 
     def handle(self, *args, **options):
@@ -31,4 +42,75 @@ def handle(self, *args, **options):
                 model_kwargs = {k.lower():v for k, v in row.items()}
                 models.append(RawDisposition(**model_kwargs))
 
-        RawDisposition.objects.bulk_create(models)
+        for i in range(0, len(models), options['batch_size']):
+            RawDisposition.objects.bulk_create(models[i:i+options['batch_size']])
+
+        self.fix_shifted()
+
+    def fix_shifted(self):
+        """Fix columns that were shifted due to bad escaping in the CSV"""
+        # HACK: This is overly verbose and could be generalized.  For now,
+        # just do this explicitly, but if we run into more examples that need
+        # this, we should make a more general solution for shifting columns
+
+        # TODO: Move this into a separate management command
+        bad_chrgdesc = "RIFLE <16''/SHOTGUN <18\",F\""
+        disps = RawDisposition.objects.filter(chrgdesc=bad_chrgdesc)
+        for disp in disps:
+            disp.amtoffine = disp.maxsent
+            disp.maxsent = disp.minsent
+            disp.ammndchrgclass = disp.ammndchrgtype
+            disp.ammndchrgtype = disp.ammndchrgdescr
+            disp.ammndchrgdescr = disp.ammndchargstatute
+            disp.ammndchargstatute = disp.chrgdispdate
+            disp.chrgdispdate = disp.chrgdisp
+            disp.chrgdisp = disp.chrgclass
+            disp.chrgclass = disp.chrgtype2
+            disp.chrgtype2 = disp.chrgtype
+            disp.chrgtype = "F"
+            disp.chrgdesc = "RIFLE <16''/SHOTGUN <18\""
+            logging.info("Fixing shifted cells due to chrgdesc in RawDisposition "
+                "with pk {}".format(disp.pk))
+            disp.save()
+
+        disps = RawDisposition.objects.filter(ammndchrgdescr=bad_chrgdesc)
+        for disp in disps:
+            disp.amtoffine = disp.maxsent
+            disp.maxsent = disp.minsent
+            disp.ammndchrgclass = disp.ammndchrgtype
+            disp.ammndchrgtype = "F"
+            disp.ammndchrgdescr = "RIFLE <16''/SHOTGUN <18\""
+            logging.info("Fixing shifted cells due to ammndchrgdescr in RawDisposition "
+                "with pk {}".format(disp.pk))
+            disp.save()
+
+        bad_address = "10716 S AVENUE M\",CHICAGO     IL\""
+        disps = RawDisposition.objects.filter(st_address=bad_address)
+        for disp in disps:
+            disp.amtoffine = disp.maxsent
+            disp.maxsent = disp.minsent
+            disp.ammndchrgclass = disp.ammndchrgtype
+            disp.ammndchrgtype = disp.ammndchrgdescr
+            disp.ammndchrgdescr = disp.ammndchargstatute
+            disp.ammndchargstatute = disp.chrgdispdate
+            disp.chrgdispdate = disp.chrgdisp
+            disp.chrgdisp = disp.chrgclass
+            disp.chrgclass = disp.chrgtype2
+            disp.chrgtype2 = disp.chrgtype
+            disp.chrgtype = disp.chrgdesc
+            disp.chrgdesc = disp.statute
+            disp.statute = disp.sex
+            disp.sex = disp.initial_date
+            disp.initial_date = disp.arrest_date
+            disp.arrest_date = disp.dob
+            disp.dob = disp.fbiidno
+            disp.fbiidno = disp.statepoliceid
+            disp.statepoliceid = disp.fgrprntno
+            disp.fgrprntno = disp.ctlbkngno
+            disp.ctlbkngno = disp.zipcode
+            disp.zipcode = disp.city_state
+            disp.city_state = "CHICAGO, IL"
+            disp.st_address = "10716 S AVENUE M"
+            logging.info("Fixing shifted cells due to st_address in RawDisposition "
+                "with pk {}".format(disp.pk))
+            disp.save()