While looking at the reimbursement details we get from the CEAP dataset provides a lot of value to do things like detecting outliers, there is no easy way to analyse the receipts provided by congresspeople with the data we get from the government. One way to "read" the receipts informations is to use OCR and, one of the easiest and best ways to do that these days is to delegate the OCR processing to Google's Cloud Vision API.
As of this writing, we have a way to OCR reimbursements on demand by using serenata-toolbox and some code outlined below that can be copy & pasted on a Jupyter notebook / Python script in case you want to OCR a set of reimbursement receipts.
To make use of the Cloud Vision API you need a Google API Key and you'll be charged after 1.000 requests in a month. The process is slow (but can be parallelized) and in order to make things simpler there has been an effort to provide a dataset with a reasonable amount of receipts OCRed ready to be analysed:
2017-02-15-receipts-texts.xz
: CSV with the full text of a reimbursement receipt in a single string based on a reimbursement'sdocument_id
.2017-02-15-receipts-texts-raw.tar.xz
: Raw Cloud Vision API responses.
Those datasets are made up of nearly 200.000 reimbursements of the following subquotas:
Aircraft renting or charter of aircraft 589
Congressperson meal 56715
Consultancy, research and technical work 5082
Flight tickets 5010
Fuels and lubricants 64989
Postal services 4804
Publicity of parliamentary activity 14387
Taxi, toll and parking 41922
Terrestrial, maritime and fluvial tickets 1786
Watercraft renting or charter 69
For more information on how it was created, check the following links:
NOTE: This dataset was created using the TEXT_DETECTION
feature
of the API but recently a feature called DOCUMENT_TEXT_DETECTION
has been introduced (currently in beta) which might yield better results and is
worth further investigation.
As mentioned above, the 2017-02-15-receipts-texts.xz
provides a CSV made up of
document_id
and the reimbursement receipt text (stored on the text
column).
Combining it reimbursements and using it on your notebooks is as easy as:
import pandas as pd
import numpy as np
# Make sure the data has been downloaded
from serenata_toolbox.datasets import fetch
fetch("2017-02-15-receipts-texts.xz", "../data")
fetch("2016-12-06-reimbursements.xz", "../data")
# Read OCR dataframe
texts = pd.read_csv('../data/2017-02-15-receipts-texts.xz', dtype={'text': np.str}, low_memory=False)
# OPTIONAL: Normalize the string to make it easier to work with
texts['text'] = texts.text.str.upper()
# Read reimbursements data and filter to 2015 to reduce the memory used by it
reimbursements = pd.read_csv('../data/2016-12-06-reimbursements.xz', low_memory=False)
reimbursements = reimbursements.query('(year >= 2015)')
# "JOIN" dataframes on document id
data = texts.merge(reimbursements, on='document_id')
The data sent to the API is an image encoded in base 64 (more info here) and the data returned is a JSON described here.
The 2017-02-15-receipts-texts-raw.tar.xz
file found on Serenata de Amor's S3
bucket contains the raw JSON responses returned by the API grouped by
document_id
and page number:
{
"1": [
// First element of the array is the full text of the receipt
{
"description": "... <FULL TEXT OF THE RECEIPT> ...",
"boundingPoly": {
"vertices": [
// The rectangle where the API found the whole text
{ "y": 469, "x": 866 },
{ "y": 469, "x": 1753 },
{ "y": 1413, "x": 1753 },
{ "y": 1413, "x": 866 }
]
},
"locale": "pt-PT"
},
// What folllows is each word found and the location it was found
{
"description": "Restaurante",
"boundingPoly": {
"vertices": [
// The rectangle where the API found the word
{ "y": 469, "x": 1009 },
{ "y": 478, "x": 1187 },
{ "y": 518, "x": 1185 },
{ "y": 509, "x": 1007 }
]
}
},
// ... other words here ...
],
"2": // Same info for page 2
}
Most of the time you'll only use the full text provided by
2017-02-15-receipts-texts.xz
but, for example, if you want to build some logic
around analysing the text of specific regions of the receipt you're able to do
that with the x
and y
coordinates above.
There are many challenges in dealing with OCRed info and more specifically for the receipts scanned and provided by the Chamber of Deputies we need to keep the following in mind:
- Many receipts are filled in by hand and unfortunately handwriting is even harder to parse than "computer text".
- Some receipts are not scanned carefully and end up being rotated (sometimes it is even scanned up side down).
- Some scans have really low quality and / or the receipt gets a bit "faded away" due to the thermal paper many places emit, making it harder for the computer to guess what's the text in it.
In other words, there is room for (non trivial) improvements on this OCR process (like some image preprocessing) and there are things we can do ourselves in order to increase the chances of finding suspicious reimbursements with the PDFs provided.