Crawler reliability issues #654

hiiamboris · 2024-08-16T14:19:52Z

Hi!

I'm trying to index all pages by mask https://github.com/red/red/issues/\d+. Expected to be 4002 pages.
How hard can it be?

What I've tried:

Running advanced crawler with depth limit 99 (doesn't allow more), hoping that it will jump from page to page until it loads all the pages.
Start page: https://github.com/red/red/issues?page=1&q=is%3Aissue
Filter: https://github.com/red/REP/issues(\?page=\d+&q=is%3Aissue|/\d+)? to exclude the irrelevant pages
In the end it indexes an arbitrary number of pages, 500-2000.
Running the same from a few starting points at once:
Start-1: https://github.com/red/red/issues?page=1&q=is%3Aissue
Start-2: https://github.com/red/red/issues?page=75&q=is%3Aissue
Start-3: https://github.com/red/red/issues?page=150&q=is%3Aissue
Same result, maybe in the range 1000-2000.
The issue seems to be in that sometimes Github fails to return the page, and the whole crawl sequence breaks. Re-loading the failed pages obviously won't let me crawl them (so whole pages of issues are lost).
Listing all links of the form https://github.com/red/red/issues/\d+ from 1 to 5535, running advanced crawler on this link list with depth=0. This must be bulletproof, right? Wrong...
In result I get around 100-200 pages only. Most of the pages seem to show link, detected from context error in Index Browser (whatever that means!?), even though these are correct issue webpages.
Some (<30%) of the links do redirect to /pulls and /discussions subpaths, but get indexed despite my efforts to avoid that by setting filtering to subpath or to https://github.com/red/red/issues/\d+ mask. Filter just gets ignored.
Typical log example: indexing-log-github.pdf

Another issue I notice is that whenever I start a new crawl, the previously crawled pages (in the same subpath) start to gradually disappear from the index, despite all settings being set to "don't delete anything".

I appreciate fixes to these issues and instructions on what can I do to work around them.

Using a docker install, default settings, robinson mode. yacy_v1.940_202407241507_d181b9e89

The text was updated successfully, but these errors were encountered:

okybaca added bug Indicates an unexpected problem or unintended behavior crawler labels Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler reliability issues #654

Crawler reliability issues #654

hiiamboris commented Aug 16, 2024

Crawler reliability issues #654

Crawler reliability issues #654

Comments

hiiamboris commented Aug 16, 2024