feat: refactor to use cheerio instead of Puppeteer #149
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The timetable scraper is notoriously slow and memory hungry. A large part of this is because it uses Puppeteer to power the scraper. This PR aims to fix this by instead using cheerio instead (we did the same for Freerooms and managed to get our scraper down from 40s to 3s).
Motivation:
Puppeteer is most useful when content on the page is dynamically loaded by scripts on the page (e.g data fetching). This is because it works by spinning up Chromium browser tabs to simulate actual page loading in a browser runtime, but of course this uses a lot of memory.
In the case of the timetable site, there is no dynamic content, there is no need to use a browser. All we need is the raw HTML returned by the GET request - this is what cheerio uses, it parses and indexes this HTML document. It uses a lot less memory and time to do this.
Results:
diff
'd the response ofhttps://timetable.csesoc.app/internal/dump
to the output of the refactored scraper - the only difference is "and" is replaced with "&" sometimes