Crawling #140

psrpinto · 2024-12-05T16:19:45Z

Crawling is the operation that automatically imports all subjects of a given type, from a "site definition" that has previously been created by the user.

psrpinto · 2024-12-05T16:20:05Z

@ashfame has started working on this at #131

psrpinto · 2024-12-05T16:27:38Z

I think we should look at Crawling as two separate stages:

Discovery: Discovery of URLs
Ingestion: Ingestion of chunks of URLs, obtained in 1.

I can think of a variety of methods to discover URLs:

Manual input by the user
Parsing a sitemap
From the site's navigation
etc

The output of the first stage would be a list of URLs:

https://example.com/blog/foo
https://example.com/blog/bar
https://example.com/blog/baz

... which would be the input of the second stage.

This two-stage solution would make it possible to re-run ingestion without re-running discovery.

ashfame · 2024-12-06T07:01:58Z

Crawling is the operation that automatically imports all subjects of a given type, from a "site definition" that has previously been created by the user.

I would suggest couple of changes to how we are defining it.

What we call as "blueprint" in code and "site definition" here is actually the definition/template/mould for a specific subject type, so perhaps we finalize the name for it to be something like SubjectDef or SubjectTemplate or SubjectMould or SubjectFootprint?
Crawling is not tied up to a subject, since we don't have means of selectively crawling pages that belong to a specific subject type. Additionally, we would crawl only once & then figure out what SubjectDef a page matches to. Hence, we should define crawling as just the act of fetching pages for the purpose of processing with the side-effect of discovering more urls to fetch. Processing is where all the logic resides.

Some more notes on crawling:

Starting point (first URL) would always be main/home page of the website. Additionally, we can have user provide URLs which are added to crawling queue and can lead to discovery of more urls. This can happen right at the screen when crawling has begun and we show progress/stats.

It would be simpler to have the user define all SubjectDefs before crawling begins. For future use-cases where a new SubjectDef is defined after crawling, we can either reuse the findings of previous crawling (this would require us to store raw html irrespective of whether a certain html page was liberated or not i.e. we found a matching SubjectDef for it) or crawl again (requires making crawling idempotent in nature).

Mental model I was working with in #131

Crawling happens on frontend side with WP as its storage mechanism.

HTML is best parsed in the browser using DOMParser
Two API endpoints are defined:
- Get next url to crawl
- Queue urls (also marks a successful crawl for the page)
Rate limiting
- 1req/s by default
- And user can increase or decrease via UI
- Handle 429s to slow down automatically

And since this is most likely not going to be blazingly fast, and actually pretty slow to begin with, considering both the rate limiting and our ingestion speed on WP side, I think we have an opportunity to show a realtime preview of pages being handled in realtime. We don't have to do this but I personally would like to see this from a UI.

psrpinto added this to the MVP milestone Dec 5, 2024

psrpinto mentioned this issue Dec 6, 2024

Compute the XPath of an element #143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawling #140

Crawling #140

psrpinto commented Dec 5, 2024 •

edited

Loading

psrpinto commented Dec 5, 2024

psrpinto commented Dec 5, 2024

ashfame commented Dec 6, 2024

Crawling #140

Crawling #140

Comments

psrpinto commented Dec 5, 2024 • edited Loading

psrpinto commented Dec 5, 2024

psrpinto commented Dec 5, 2024

ashfame commented Dec 6, 2024

Some more notes on crawling:

Mental model I was working with in #131

psrpinto commented Dec 5, 2024 •

edited

Loading