Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map charge descriptions to crosswalk categories #6

Open
bepetersn opened this issue Aug 11, 2014 · 13 comments
Open

Map charge descriptions to crosswalk categories #6

bepetersn opened this issue Aug 11, 2014 · 13 comments
Assignees

Comments

@bepetersn
Copy link
Member

For statutes with multiple IUCR possibilties, make csv mapping between charge descriptions and crosswalk categories. Also to IUCR codes, if an obvious mapping can be made, and eventually to our own categories of interest.

@bepetersn bepetersn changed the title Make csv mapping between charge descriptions and crosswalk categories For statutes with multiple IUCR possibilties, make csv mapping between charge descriptions and crosswalk categories Aug 11, 2014
@bepetersn bepetersn changed the title For statutes with multiple IUCR possibilties, make csv mapping between charge descriptions and crosswalk categories Map charge descriptions to crosswalk categories Aug 11, 2014
@bepetersn bepetersn self-assigned this Aug 11, 2014
@bepetersn
Copy link
Member Author

@ghing,

It turns out that charges descriptions have more reliable mappings to IUCR categories than ILCS statutes, but they're still not perfect. 15.6% of unique charge descriptions map to multiple IUCR categories.

Under Miscellaneous convictions data I've uploaded two JSON files:

  • one containing the mapping of all charge descriptions to the IUCR categories they appear with in our data
  • one containing the same mapping, filtered to ones where there are multiple IUCR categories for a given charge description.

More to come...

@ghing
Copy link
Contributor

ghing commented Aug 18, 2014

@bepetersn. I'll take a look at this today. We might just have to make our own call for mapping the ambiguous ones and document it clearly.

@ghing
Copy link
Contributor

ghing commented Aug 19, 2014

@bepetersn I'm a little confused by these. For example, this mapping from chrgdesc_to_category__multiples.json:

    "ATT.FORGERY": [
        "Battery",
        "Burglary"
    ],

I don't see why this description would match to either of these categories. Are the JSON files based on ILCS -> IUCR, or on charge description or a combination? Do you have code that implements your methodology for generating these mappings.

bepetersn added a commit that referenced this issue Aug 20, 2014
First attempt towards #6; create a JSON file mapping chrgdesc to category
@bepetersn
Copy link
Member Author

See my code at: #10

Several notes:

  • I looked at dispositions here, instead of convictions, if that matters. My database is out of date in terms of some the "final" fields you added, and as a result I couldn't figure out how to access these fields from the Conviction model.
  • With regard to your question about that example of "ATT.FORGERY" in the charge description field and "Battery" and "Burglary" in the IUCR categories, it is weird. Running the SQL on my database, however, I can confirm that this is what I see.
  • Finally, if my belief is correct, this mapping from charge descriptions to IUCR categories doesn't represent any dispositions that couldn't be given an IUCR code, where I think we got the category from. Thus it doesn't yet fulfill our goal of getting more categories from the data. Realizing this now, I need to iterate on this to build a more full list by starting with statutes again.

@ghing
Copy link
Contributor

ghing commented Aug 20, 2014

@bepetersn I'll take a look at #10.

Thanks for clarifying the "ATT. FORGERY" issue. It sounds like there might be some disconnect for some records between the statute and the chargedesc. I'll take a quick peek and let you know what I find.

It shouldn't matter that you looked at dispositions since that's the source for the convictions anyway. It just means that you're looking through more records.

The mappings from charge description to IUCR category won't capture records that didn't have an IUCR code calculated from statute.

I think the first step moving forward would be to start making our own mapping from (final)_chargedesc to IUCR categories based on looking at the values of the charge description. For instance, "ATT. FORGERY" seems like it clearly maps to the "ATTEMPTED FORGERY" IUCR category. I think we could figure out mappings for most descriptions. Does this make sense to you?

@ghing
Copy link
Contributor

ghing commented Aug 20, 2014

@bepetersn, FYI I've uploaded a recent snapshot of the database to drive.

@ghing
Copy link
Contributor

ghing commented Aug 22, 2014

@bepetersn, I took a look at just the "ATT.FORGERY" case. It seems like there might have been some difficulty parsing the statute field to get the IUCR code/category which your management command was using to grab the categories. This makes sense because, at least for these dispositions, it looks like they tried to cram two different statutes into one field. 😿

dispositions = Disposition.objects.filter(final_chrgdesc="ATT.FORGERY")
for d in dispositions:
    print(d.case_number, d.final_statute)

The output is:

2007CR1475501 720-5/8-4(720-5/17-3(A)(1)
2007CR1475501 720-5/8-4(720-5/17-3(A)(1)
2007CR1475501 720-5/8-4(720-5/17-3(A)(1)
2008CR0853801 720-5/8-4(720-5/17-3(A)(2)
2008CR0853801 720-5/8-4(720-5/17-3(A)(2)
2008CR1582701 720-5(8-4)/17-3(A)(2)

720-5/17-3 maps to forgery, so I'm not sure where the Battery and Burglary mappings are coming from.

This makes me think that some of the more weird mappings might be due to parse issues. I wonder if we're better off making our own map of chargedesc to category.

Have you done any more digging into this?

@bepetersn
Copy link
Member Author

Hey @ghing, I did some more digging. The majority of the multiples that we saw before were of the type that you said: coming from parsing errors, whether in the ILCS or IUCR modules. After removing a bug in my chrgdesc2category command (which was allowing some instances of parsing errors to go through to my generated list of multiples), I found that the number of multiples went down to just 45, from around 265.

I believe I also got cases where there were multiple IUCRs associated with a charge description but all with the same IUCR category to feed into the mapping.

Finally, after adding a check to make sure the category is found in the IUCR crosswalk along the lines of what I talked about in #14, the number of multiples went down to 3 (I might be able to get it to none).

I need to run a check to see how many of the convictions I'm actually able to reliably account for using this new mapping of charge descriptions to IUCR categories, but I'm somewhat hopeful. For now, the new chrgdesc_to_category__all.json and __multiples.json files are on the Drive folder.

I'm also going to upload my code tonight.

@bepetersn
Copy link
Member Author

So the number of convictions for which I was able to successfully make a one-to-one mapping from its charge description to its IUCR category was 80.39% this time, or 27,743 missed records. A little bit worse, but I think we could make it better.

@ghing
Copy link
Contributor

ghing commented Sep 2, 2014

@bepetersn, let's hold of on working on this further until I finish a pass on my drug queries so we can figure out the best approach for this. I think we'll want to focus on our areas of interest rather than trying to get a clean category for every charge.

@bepetersn
Copy link
Member Author

Ok. You should see the two new files, though. I've mostly got the mapping created. The multiples are 246 items long. We can really easily roll most of them up into the categorizations you are defining (most of them are going to map to a property crime, a few to sexual assault, etc.) The other 1300-some charge descriptions map to a single category, and we should be able to decide how to roll up these single categories into property/sexual/drug/violent really easily too.

The only thing I really want to do still is turn these JSON files into a CSV table.

@ghing
Copy link
Contributor

ghing commented Sep 2, 2014

@bepetersn, ok. I'll take a look at the new multiples file.

@ghing
Copy link
Contributor

ghing commented Nov 19, 2014

I've been doing some fixes to ILCS statute parsing and also looked through the duplicates and made mappings in this spreadsheet.

In many cases the mapping is genuinely ambiguous but we should be able to map them to our broader categories: violent, property, drug, index/nonindex, etc.

ghing added a commit to ghing/cook-convictions-data that referenced this issue Dec 1, 2014
Add statute-based queries to group together violent, nonviolent,
affecting women and drug crimes (and index crimes within each)
to handle cases where an ILCS statute doesn't map to an IUCR code
or the mapping is ambiguous.

Addresses sc3/cook-convictions#83,
sc3#7,
sc3#6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants