-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automate schedule downloads #61
Changes from 11 commits
a2af9bf
4b38f62
50a8a4e
f56d0d4
140ffbc
9f00363
12f6b08
8aa3691
2ee3d05
7c6a42e
bc91766
c0c153c
2dc18f3
d35f310
ee7b057
461df42
e1baeaa
398d62a
77ef708
4842fa0
3817614
cfb0960
c08335a
1eef5d2
b56f0c8
665e90e
6e287ec
9b04970
f0bd45a
cebd713
20c595f
4c06991
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
name: Automated job | ||
|
||
on: | ||
push: | ||
branches: | ||
- 'automate-schedule-downloads' | ||
|
||
schedule: | ||
# Run every day at 12:30pm CST which is 5:30pm UTC | ||
- cron: 30 17 * * * | ||
|
||
|
||
jobs: | ||
download-cta-schedule-data: | ||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- uses: actions/checkout@v3 | ||
|
||
- uses: actions/setup-python@v4 | ||
with: | ||
python-version: '3.10' | ||
|
||
- name: Download and save CTA schedule data | ||
env: | ||
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
run: | | ||
pip install -r requirements.txt | ||
python -m scrape_data.cta_schedule_versions $AWS_ACCESS_KEY_ID $AWS_SECRET_ACCESS_KEY |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
import boto3 | ||
import sys | ||
import data_analysis.static_gtfs_analysis as sga | ||
import pendulum | ||
from io import StringIO | ||
|
||
ACCESS_KEY = sys.argv[1] | ||
SECRET_KEY = sys.argv[2] | ||
|
||
client = boto3.client( | ||
's3', | ||
aws_access_key_id=ACCESS_KEY, | ||
aws_secret_access_key=SECRET_KEY | ||
) | ||
|
||
s3 = boto3.resource( | ||
's3', | ||
region_name='us-east-1', | ||
aws_access_key_id=ACCESS_KEY, | ||
aws_secret_access_key=SECRET_KEY | ||
) | ||
|
||
date = pendulum.now().to_date_string() | ||
|
||
zipfile, zipfile_bytes_io = sga.download_cta_zip() | ||
print(f'Saving zipfile available at ' | ||
f'https://www.transitchicago.com/downloads/sch_data/google_transit.zip ' | ||
f'on {date} to public bucket') | ||
zipfile_bytes_io.seek(0) | ||
client.upload_fileobj(zipfile_bytes_io, 'chn-ghost-buses-public', f'google_transit_{date}.zip') | ||
|
||
data = sga.download_extract_format() | ||
trip_summary = sga.make_trip_summary(data) | ||
|
||
route_daily_summary = ( | ||
sga.summarize_date_rt(trip_summary) | ||
) | ||
|
||
csv_buffer = StringIO() | ||
route_daily_summary.to_csv(csv_buffer) | ||
|
||
print(f'Saving cta_route_daily_summary_{date}.csv to public bucket') | ||
s3.Object('chn-ghost-buses-public', f'cta_route_daily_summary_{date}.csv')\ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment as above regarding an intermediate directory level for this path, and this one is a bit trickier since we already generate these files from the current batched process (currently in Maybe Another question here would be whether we want to only save that day's activity (i.e, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Saving only that date would probably be good since we could concatenate the data later. |
||
.put(Body=csv_buffer.getvalue()) | ||
|
||
|
||
# https://stackoverflow.com/questions/30249069/listing-contents-of-a-bucket-with-boto3 | ||
print('Confirm that objects exist in bucket') | ||
s3_paginator = client.get_paginator('list_objects_v2') | ||
|
||
def keys(bucket_name, prefix='/', delimiter='/', start_after=''): | ||
prefix = prefix.lstrip(delimiter) | ||
start_after = (start_after or prefix) if prefix.endswith(delimiter) else start_after | ||
for page in s3_paginator.paginate(Bucket=bucket_name, Prefix=prefix, StartAfter=start_after): | ||
for content in page.get('Contents', ()): | ||
if content['Key'] in [f'cta_route_daily_summary_{date}.csv', f'google_transit_{date}.zip']: | ||
print(f"{content['Key']} exists") | ||
|
||
keys('chn-ghost-buses-public') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add another directory level in this path? Right now the zipfiles are just being written to the root directory which doesn't pose a technical problem but will probably get a bit messy quickly. Maybe something like
f'cta_schedule_zipfiles_raw/google_transit_{date}.zip
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I can add a new directory.