Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: refactor to use cheerio instead of Puppeteer #149

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

francojreyes
Copy link
Contributor

@francojreyes francojreyes commented Apr 9, 2024

The timetable scraper is notoriously slow and memory hungry. A large part of this is because it uses Puppeteer to power the scraper. This PR aims to fix this by instead using cheerio instead (we did the same for Freerooms and managed to get our scraper down from 40s to 3s).

Motivation:
Puppeteer is most useful when content on the page is dynamically loaded by scripts on the page (e.g data fetching). This is because it works by spinning up Chromium browser tabs to simulate actual page loading in a browser runtime, but of course this uses a lot of memory.

In the case of the timetable site, there is no dynamic content, there is no need to use a browser. All we need is the raw HTML returned by the GET request - this is what cheerio uses, it parses and indexes this HTML document. It uses a lot less memory and time to do this.

Results:

  • On my machine, the scraper takes 3.5 minutes to run - the cheerio refactor takes 35 seconds
  • Idk if it's using less CPU/memory but like it must
  • I diff'd the response of https://timetable.csesoc.app/internal/dump to the output of the refactored scraper - the only difference is "and" is replaced with "&" sometimes
  • Sometimes, randomly, it doesn't work and returns "ENOTFOUND" or 404 for pages that definitely exist - I don't know why, it's probably a rate limit, but seems to work if you just rerun it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant