The disclosure-backend-static
repo is the backend powering Open Disclosure California.
It was created in haste running up to the 2016 election, and thus is engineered around a "get it done" philosophy. At that time, we had already designed an API and built (most of) a frontend; this repo was created to implement those as quickly as possible.
This project implements a basic ETL pipeline to download the Oakland netfile data, download the CSV human-curated data for Oakland, and combine the two. The output is a directory of JSON files which mimic the existing API structure so no client code changes will be required.
- Ruby 3.3.4 (see version in
.ruby-version
)
Note: You do not need to run these commands to develop on the frontend. All you need to do is clone the repository adjacent to the frontend repo.
If you're planning to modify the backend code, follow these steps to set up all necessary development dependencies, including a new PostgreSQL database and Python 3:
- Ensure Homebrew is up to date by running:
brew update && brew upgrade
- Install PostgreSQL version 16 with Homebrew:
brew install postgresql@16
- To automatically start the database at login, use:
brew services start postgresql@16
- To automatically start the database at login, use:
- Verify Python 3 is installed on your system. If not, download it from python.org.
- macOS natively runs Python 2, so use
python3 -m pip
instead ofpip
to ensure Python 3 is used:python3 -m pip install ...
- Or if your system's
pip
points to Python 3, you can usepip
directly:pip install ...
- macOS natively runs Python 2, so use
- Run
sudo -H python -m pip install -r requirements.txt gem install pg bundler bundle install
This repository is set up to work in a container under Codespaces. In other words, you can start up an environment that is already set up without having to do any of the installation steps required to set up a local environment. This can be used as a way to trouble-shoot code before it is committed to the production pipeline. The following information may be helpful to get started using Codespaces:
- Go to the page displaying the repository code: https://github.com/caciviclab/disclosure-backend-static
- Change the branch to the branch that you're interested in running and editing.
- Click on the
Code
button and click theCodespaces
tab in the dropdown - Click on the button to start a new codespace
- Wait for the container to be set up and a terminal prompt at
/workspace
to be presented in the web page, which will look familiar if you've worked with VS Code before - In the terminal, you can start running the commands from the next section, such as
make download
- There will be a running Postgres database started on localhost and you can simply type
psql
in the terminal to connect to the server - The
make import
command will populate the Postgres database - Click on the cylinder icon on the left to access the SQL Tools extension, which will allow you to connect to Postgres and view its schema as well as query the data that you imported
- The container has Python 3.9 and a current version of Ruby installed, along with all the required packages
- We will try to get the container to match the environment for Travis CI, including trying to get the same Ruby version installed
- This same setup allows anyone to run the same container on their local machine in VS Code using the Dev Containers extension, but we are mainly focused on enabling Codespaces right now and making sure that we can harden the setup for Codespaces first. (We'll be adding instructions for Dev Containers once this is solid)
- If you make changes in your Codespaces environment, don't forget to push it after committing it with Git. It is like another machine, so it won't end up in the GitHub repository if don't do a
git push
This repository is also configured to run within a Docker container. This is similar to Codespaces except you can use whatever IDE and local setup that you prefer. Here's how to get started using Docker with VSCode:
- Download Docker and launch Docker Desktop.
- Version 4.32.0 (157355) has been verified to work with this repository.
- Install the VSCode Dev Containers extension.
- Version v0.375.1 has been confirmed to be compatible with this repository.
- Note: VSCode may suggest installing the Docker extension as well, but you only need the Dev Containers extension for this setup.
- In VSCode, navigate to .devcontainer/Dockerfile or restart VSCode. The Dev Containers extension should prompt you to reopen the repository in a container. Proceed with this option.
- Once the container environment is loaded, complete the setup by following the commands below.
Download the raw data files. You only need to run this once in a while to get the latest data.
$ make download
Import the data into the database for easier processing. You only need to run this after you've downloaded new data.
$ make import
Run the calculators. Everything is output into the "build" folder.
$ make process
Optionally, reindex the build outputs into Algolia. (Reindexing requires the ALGOLIASEARCH_APPLICATION_ID and ALGOLIASEARCH_API_KEY environment variables).
$ make reindex
If you want to serve the static JSON files via a local web server:
$ make run
When make import
is run, a number of postgres tables are created for importing the downloaded data. The schema of these tables are explicitly defined in the dbschema
directory and may have to be updated in the future to accommodate future data. Columns that hold string data may not be sized large enough for future data. For example, if a name column accepts names of at most 20 characters and in the future, we have data where the name is 21 characters long, the data import will fail. When this occurs, we will have to update the corresponding schema file in dbschema
to support more characters. Simply make the change and re-run make import
to verify that it succeeds.
This repository is used to generate data files that are used by the website. After make process
is run, a build
directory is generated containing the data files. This directory is checked in to the repository and later checked out when generating the website. After making code changes, it is important to compare the generated build
directory against the build
directory generated before the code changes and verify that changes from the code changes are as expected.
Because a strict comparison of all contents of the build
directory will always include changes that occur independent of any code change, every developer has to know about these expected changes in order to perform this check. To remove the need for this, a specific file, bin/create-digests.py
, generate digests for JSON data in the build
directory after excluding these expected changes. To look for changes that exclude these expected changes, simply look for a change in the build/digests.json
file.
Currently, these are the expected changes that occur independent of any code change:
- timestamps change for each run
- top contributors lists contain undefined ordering of contributors with the same contribution
- top spenders lists contain undefined ordering of spenders with the same spending
- rounding differences for floats
The expected changes are excluded before generating digests for data in the build
directory. The logic for this can be found in the function clean_data
, found in the file bin/create-digests.py
. After the code is modified such that an expected change no longer exists, the exclusion of that change can be removed from clean_data
. For example, the rounding of floats are not consistently the same each time make process
is run, due to differences in the environment. When the code is fixed so that the rounding of floats is the same as long as the data hasn't changed, the round_float
call in clean_data
can be removed.
An additional script has been created to generate a report that enables comparing the totals for candidates. The script is bin/report-candidates.py
and it generates build/candidates.csv
and build/candidates.xlsx
. The reports include a list of all the candidates and totals calculated multiple ways that should add up to the same number.
To ensure that database schema changes are visible in pull requests, the complete postgres schema is also saved to a schema.sql
file in the build
directory. Because the build
directory is automatically re-built for each branch in a PR and committed to the repository, any change to the schema caused by a code change will be shown a difference in the schema.sql
file when reviewing the PR.
Each metric about a candidate is calculated independently. A metric might be something like "total contributions received" or something more complex like "percentage of contributions that are less than $100".
When adding a new calculation, a good first place to start is the official Form
460. Is the data are you looking for reported on that form? If so,
you will probably find it in your database after the import process. There are
also a couple other forms that we import, like Form 496. (These are the names of
the files in the input
directory. Check those out.)
Each schedule of each form is imported into a separate postgres table. For
example, Schedule A of Form 460 is imported into the
A-Contributions
table.
Now that you have a way of querying the data, you should come up with a SQL query that calculates the value you are trying to get. Once you can express your calcualtion as SQL, put it in a calcuator file like so:
- Create a new file named
calculators/[your_thing]_calculator.rb
- Here is some boilerplate for that file:
# the name of this class _must_ match the filename of this file, i.e. end
# with "Calculator" if the file ends with "_calculator.rb"
class YourThingCalculator
def initialize(candidates: [], ballot_measures: [], committees: [])
@candidates = candidates
@candidates_by_filer_id = @candidates.where('"FPPC" IS NOT NULL')
.index_by { |candidate| candidate['FPPC'] }
end
def fetch
@results = ActiveRecord::Base.connection.execute(<<-SQL)
-- your sql query here
SQL
@results.each do |row|
# make sure Filer_ID is returned as a column by your query!
candidate = @candidates_by_filer_id[row['Filer_ID'].to_i]
# change this!
candidate.save_calculation(:your_thing, row[column_with_your_desired_data])
end
end
end
- You will want to fill in the SQL query and make sure that the query selects
the
Filer_ID
column. - Make sure to update the call to
candidate.save_calculation
. That method will serialize its second argument as JSON, so it can store any kind of data. - Your calculation can be retrieved with
candidate.calculation(:your_thing)
. You will want to add this into an API response in theprocess.rb
file.
This is how the data flows through the back end. Finance data is pulled from Netfile which is supplemented by a Google Sheet mapping Filer Ids to ballot information like candidate names, offices, ballot measures, etc. Once data is filtered, aggregated, and transformed, the front end consumes it and builds the static HTML front end.
During Bundle Install
error: use of undeclared identifier 'LZMA_OK'
Try:
brew unlink xz
bundle install
brew link xz
During make download
wget: command not found
Run brew install wget
.
During make import
It appears there is a problem on Macintosh systems using the Apple Chips.
ImportError: You don't appear to have the necessary database backend installed for connection string you're trying to use. Available backends include:
PostgreSQL: pip install psycopg2
Try the following:
pip uninstall psycopg2-binary
pip install psycopg2-binary --no-cache-dir