Automate schedule downloads #61

dcjohnson24 · 2023-07-26T01:14:32Z

Description

Create a GitHub Action to download the schedule data from the CTA and save to s3. The workflow will be run every day at 5:30pm UTC or on a push to the automate-schedule-downloads branch. The push event trigger can be removed once the PR has been approved.

Resolves #18, working toward #50.

Type of change

Bug fix
New functionality
Documentation

How has this been tested?

Locally

lauriemerrell

Thank you so much Dylan this looks great! Just two comments about organization/file sizes in the bucket.

lauriemerrell · 2023-07-31T19:46:00Z

scrape_data/cta_schedule_versions.py

+      f'https://www.transitchicago.com/downloads/sch_data/google_transit.zip '
+      f'on {date} to public bucket')
+zipfile_bytes_io.seek(0)
+client.upload_fileobj(zipfile_bytes_io, 'chn-ghost-buses-public', f'google_transit_{date}.zip')


Can we add another directory level in this path? Right now the zipfiles are just being written to the root directory which doesn't pose a technical problem but will probably get a bit messy quickly. Maybe something like f'cta_schedule_zipfiles_raw/google_transit_{date}.zip?

Good point, I can add a new directory.

lauriemerrell · 2023-07-31T19:55:35Z

scrape_data/cta_schedule_versions.py

+route_daily_summary.to_csv(csv_buffer)
+
+print(f'Saving cta_route_daily_summary_{date}.csv to public bucket')
+s3.Object('chn-ghost-buses-public', f'cta_route_daily_summary_{date}.csv')\


Same comment as above regarding an intermediate directory level for this path, and this one is a bit trickier since we already generate these files from the current batched process (currently in schedule_summaries/route_level)... We probably will need to figure out how to do a cutover from the old process to new.

Maybe f'schedule_summaries/daily_job/?

Another question here would be whether we want to only save that day's activity (i.e, route_daily_summary[route_daily_summary.date = {date}]), because otherwise these files are pretty big to just save literally every day.

Saving only that date would probably be good since we could concatenate the data later.

lauriemerrell · 2023-08-02T02:09:15Z

Note from discussion 8/1: Maybe make two GitHub actions, one that only downloads and a second that creates the processed route trip count file based on the downloaded file

This reverts commit 3817614.

…e-downloads Automate schedule downloads

dcjohnson24 added 11 commits July 18, 2023 18:07

First commit for downloading and saving schedule data

a2af9bf

Fix syntax error

4b38f62

Change version constraint of mapclassify

50a8a4e

remove single quote

f56d0d4

Run as a module

140ffbc

Add print function for saving csv to public bucket

9f00363

Download schedule daily at 5:30pm UTC

12f6b08

Save zipfile from transitchicago.com to s3

8aa3691

Change method of uploading zipfile

2ee3d05

Check that objects exist in bucket

7c6a42e

Change yield to print

bc91766

lauriemerrell requested changes Jul 31, 2023

View reviewed changes

dcjohnson24 added 17 commits August 7, 2023 21:01

Separate downloading zip file and saving daily summaries

c0c153c

remove job dependency

2dc18f3

Add args to same line

d35f310

Save realtime summary file

ee7b057

Change to string

461df42

Correct python version name

e1baeaa

Add quotes

398d62a

Add environment context

77ef708

Remove quotes

4842fa0

Test without environment variables

3817614

Revert "Test without environment variables"

cfb0960

This reverts commit 3817614.

Change python version

c08335a

Loosen constraint on pandas version

1eef5d2

Change cta_schedule_versions to cta_data_downloads

b56f0c8

Install libgeo-dev

665e90e

Back to python 3.10

6e287ec

Change back to version constraint

9b04970

dcjohnson24 added 3 commits August 13, 2023 19:44

Change timezone to America/Chicago

f0bd45a

Change to correct end date for realtime data

cebd713

rename schedule summary function

20c595f

lauriemerrell approved these changes Sep 20, 2023

View reviewed changes

remove on push

4c06991

lauriemerrell merged commit 0a73534 into test_cronjob Sep 20, 2023

lauriemerrell deleted the automate-schedule-downloads branch September 20, 2023 02:15

haileyplusplus pushed a commit to haileyplusplus/chn-ghost-buses that referenced this pull request Apr 1, 2024

Merge pull request chihacknight#61 from chihacknight/automate-schedul…

bf7f4f4

…e-downloads Automate schedule downloads

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate schedule downloads #61

Automate schedule downloads #61

dcjohnson24 commented Jul 26, 2023 •

edited

Loading

lauriemerrell left a comment

lauriemerrell Jul 31, 2023

dcjohnson24 Aug 1, 2023

lauriemerrell Jul 31, 2023

dcjohnson24 Aug 1, 2023

lauriemerrell commented Aug 2, 2023

Automate schedule downloads #61

Automate schedule downloads #61

Conversation

dcjohnson24 commented Jul 26, 2023 • edited Loading

Description

Type of change

How has this been tested?

lauriemerrell left a comment

Choose a reason for hiding this comment

lauriemerrell Jul 31, 2023

Choose a reason for hiding this comment

dcjohnson24 Aug 1, 2023

Choose a reason for hiding this comment

lauriemerrell Jul 31, 2023

Choose a reason for hiding this comment

dcjohnson24 Aug 1, 2023

Choose a reason for hiding this comment

lauriemerrell commented Aug 2, 2023

dcjohnson24 commented Jul 26, 2023 •

edited

Loading