forked from sc3/cook-convictions-data
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Use Postgres, document all ETL steps
As we near launch, I wanted to make sure we could rebuild the analysis database from scratch. I found insert and update speeds to be slow with SQLite, my original database choice, so I wanted to try using Postgres when I rebuild the database. While it may be possible to update the code to work better with SQLite, we don't have a lot of time and I'd likely lose some of the convenient abstractions of the Django ORM layer. In the process of this, I updated the README so all steps in the transform/processing pipeline were reflected. This commit includes updates to make the processing pipeline work with Postgres. One of the big differences is that Postgres seems much more picky about the field lengths of text fields, which is actually great as a way to find data quirks, but there were a number of updates that needed to happen to fix data errors. When parsing city and state fields, return the state abbreviation, not the value of the state field. Hack around false positives when looking up states using the ``us`` package. Update the ``load_dispositions_csv`` management command to insert records in batches to avoid the process crashing of Postgres generating an error. Also, fix shifted columns due to bad CSV quoting when dispositions are loaded in ``load_dispositions_csv``. Update the ``create_dispositions`` management command to process the RawDisposition instances in batches to avoid running out of memory. Add assertions in the ``Disposition`` methods that load and parse values from the ``RawDisposition`` fields to make sure things like state or charge class are of the correct length. Cleanly handle exceptions when trying to detect IUCR code for a statute. Fix the name field of the ``CenusPlace`` model so its 100 chars long instead of 7. This was a just a typo in the original code. You'll have to run ``manage.py migrate convictions_data`` to update your database with this change. Add missing ``place`` field to the ``CONVICTIONS_IMPORT_FIELDS`` list when creating convictions from dispositions. Update extra SQL in IUCR code queries to work with Postgres. This is mostly an issue of how things are quoted. Remove trailing whitespace throughout modified files. Update age based queries to use Postgres functions to calculate the age. This likely breaks things in SQLite. Also remove ``AgeQuerySetMixin.with_ages`` since Postgres doesn't let you use an alias in the ``WHERE`` clause as the ``WHERE`` clause is evaluated first. Instead, we have to include the age expression directly within the WHERE clause. The upside of this is that we can now use the count() method to get the crimes by type instead of having to use len() and evaluating the queryset. This query still runs kind of slow, however.
- Loading branch information
Showing
10 changed files
with
517 additions
and
111 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.