Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawling #140

Open
psrpinto opened this issue Dec 5, 2024 · 3 comments
Open

Crawling #140

psrpinto opened this issue Dec 5, 2024 · 3 comments
Milestone

Comments

@psrpinto
Copy link
Member

psrpinto commented Dec 5, 2024

Crawling is the operation that automatically imports all subjects of a given type, from a "site definition" that has previously been created by the user.

@psrpinto psrpinto added this to the MVP milestone Dec 5, 2024
@psrpinto
Copy link
Member Author

psrpinto commented Dec 5, 2024

@ashfame has started working on this at #131

@psrpinto
Copy link
Member Author

psrpinto commented Dec 5, 2024

I think we should look at Crawling as two separate stages:

  1. Discovery: Discovery of URLs
  2. Ingestion: Ingestion of chunks of URLs, obtained in 1.

I can think of a variety of methods to discover URLs:

  • Manual input by the user
  • Parsing a sitemap
  • From the site's navigation
  • etc

The output of the first stage would be a list of URLs:

https://example.com/blog/foo
https://example.com/blog/bar
https://example.com/blog/baz

... which would be the input of the second stage.

This two-stage solution would make it possible to re-run ingestion without re-running discovery.

@ashfame
Copy link
Member

ashfame commented Dec 6, 2024

Crawling is the operation that automatically imports all subjects of a given type, from a "site definition" that has previously been created by the user.

I would suggest couple of changes to how we are defining it.

  1. What we call as "blueprint" in code and "site definition" here is actually the definition/template/mould for a specific subject type, so perhaps we finalize the name for it to be something like SubjectDef or SubjectTemplate or SubjectMould or SubjectFootprint?

  2. Crawling is not tied up to a subject, since we don't have means of selectively crawling pages that belong to a specific subject type. Additionally, we would crawl only once & then figure out what SubjectDef a page matches to. Hence, we should define crawling as just the act of fetching pages for the purpose of processing with the side-effect of discovering more urls to fetch. Processing is where all the logic resides.

Some more notes on crawling:

Starting point (first URL) would always be main/home page of the website. Additionally, we can have user provide URLs which are added to crawling queue and can lead to discovery of more urls. This can happen right at the screen when crawling has begun and we show progress/stats.

Image

It would be simpler to have the user define all SubjectDefs before crawling begins. For future use-cases where a new SubjectDef is defined after crawling, we can either reuse the findings of previous crawling (this would require us to store raw html irrespective of whether a certain html page was liberated or not i.e. we found a matching SubjectDef for it) or crawl again (requires making crawling idempotent in nature).

Mental model I was working with in #131

Crawling happens on frontend side with WP as its storage mechanism.

  • HTML is best parsed in the browser using DOMParser
  • Two API endpoints are defined:
    • Get next url to crawl
    • Queue urls (also marks a successful crawl for the page)
  • Rate limiting
    • 1req/s by default
    • And user can increase or decrease via UI
    • Handle 429s to slow down automatically

And since this is most likely not going to be blazingly fast, and actually pretty slow to begin with, considering both the rate limiting and our ingestion speed on WP side, I think we have an opportunity to show a realtime preview of pages being handled in realtime. We don't have to do this but I personally would like to see this from a UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants