-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automate schedule downloads #61
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much Dylan this looks great! Just two comments about organization/file sizes in the bucket.
scrape_data/cta_schedule_versions.py
Outdated
f'https://www.transitchicago.com/downloads/sch_data/google_transit.zip ' | ||
f'on {date} to public bucket') | ||
zipfile_bytes_io.seek(0) | ||
client.upload_fileobj(zipfile_bytes_io, 'chn-ghost-buses-public', f'google_transit_{date}.zip') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add another directory level in this path? Right now the zipfiles are just being written to the root directory which doesn't pose a technical problem but will probably get a bit messy quickly. Maybe something like f'cta_schedule_zipfiles_raw/google_transit_{date}.zip
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I can add a new directory.
scrape_data/cta_schedule_versions.py
Outdated
route_daily_summary.to_csv(csv_buffer) | ||
|
||
print(f'Saving cta_route_daily_summary_{date}.csv to public bucket') | ||
s3.Object('chn-ghost-buses-public', f'cta_route_daily_summary_{date}.csv')\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above regarding an intermediate directory level for this path, and this one is a bit trickier since we already generate these files from the current batched process (currently in schedule_summaries/route_level
)... We probably will need to figure out how to do a cutover from the old process to new.
Maybe f'schedule_summaries/daily_job/
?
Another question here would be whether we want to only save that day's activity (i.e, route_daily_summary[route_daily_summary.date = {date}]
), because otherwise these files are pretty big to just save literally every day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saving only that date would probably be good since we could concatenate the data later.
Note from discussion 8/1: Maybe make two GitHub actions, one that only downloads and a second that creates the processed route trip count file based on the downloaded file |
This reverts commit 3817614.
…e-downloads Automate schedule downloads
Description
Create a GitHub Action to download the schedule data from the CTA and save to s3. The workflow will be run every day at 5:30pm UTC or on a push to the
automate-schedule-downloads
branch. Thepush
event trigger can be removed once the PR has been approved.Resolves #18, working toward #50.
Type of change
How has this been tested?
Locally