-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawling #140
Comments
I think we should look at Crawling as two separate stages:
I can think of a variety of methods to discover URLs:
The output of the first stage would be a list of URLs:
... which would be the input of the second stage. This two-stage solution would make it possible to re-run ingestion without re-running discovery. |
I would suggest couple of changes to how we are defining it.
Some more notes on crawling:Starting point (first URL) would always be main/home page of the website. Additionally, we can have user provide URLs which are added to crawling queue and can lead to discovery of more urls. This can happen right at the screen when crawling has begun and we show progress/stats. It would be simpler to have the user define all SubjectDefs before crawling begins. For future use-cases where a new SubjectDef is defined after crawling, we can either reuse the findings of previous crawling (this would require us to store raw html irrespective of whether a certain html page was liberated or not i.e. we found a matching SubjectDef for it) or crawl again (requires making crawling idempotent in nature). Mental model I was working with in #131Crawling happens on frontend side with WP as its storage mechanism.
And since this is most likely not going to be blazingly fast, and actually pretty slow to begin with, considering both the rate limiting and our ingestion speed on WP side, I think we have an opportunity to show a realtime preview of pages being handled in realtime. We don't have to do this but I personally would like to see this from a UI. |
Crawling is the operation that automatically imports all subjects of a given type, from a "site definition" that has previously been created by the user.
The text was updated successfully, but these errors were encountered: