You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to index all pages by mask https://github.com/red/red/issues/\d+. Expected to be 4002 pages.
How hard can it be?
What I've tried:
Running advanced crawler with depth limit 99 (doesn't allow more), hoping that it will jump from page to page until it loads all the pages.
Start page: https://github.com/red/red/issues?page=1&q=is%3Aissue
Filter: https://github.com/red/REP/issues(\?page=\d+&q=is%3Aissue|/\d+)? to exclude the irrelevant pages
In the end it indexes an arbitrary number of pages, 500-2000.
Listing all links of the form https://github.com/red/red/issues/\d+ from 1 to 5535, running advanced crawler on this link list with depth=0. This must be bulletproof, right? Wrong...
In result I get around 100-200 pages only. Most of the pages seem to show link, detected from context error in Index Browser (whatever that means!?), even though these are correct issue webpages.
Some (<30%) of the links do redirect to /pulls and /discussions subpaths, but get indexed despite my efforts to avoid that by setting filtering to subpath or to https://github.com/red/red/issues/\d+ mask. Filter just gets ignored.
Typical log example: indexing-log-github.pdf
Another issue I notice is that whenever I start a new crawl, the previously crawled pages (in the same subpath) start to gradually disappear from the index, despite all settings being set to "don't delete anything".
I appreciate fixes to these issues and instructions on what can I do to work around them.
Using a docker install, default settings, robinson mode. yacy_v1.940_202407241507_d181b9e89
The text was updated successfully, but these errors were encountered:
Hi!
I'm trying to index all pages by mask
https://github.com/red/red/issues/\d+
. Expected to be 4002 pages.How hard can it be?
What I've tried:
Start page: https://github.com/red/red/issues?page=1&q=is%3Aissue
Filter:
https://github.com/red/REP/issues(\?page=\d+&q=is%3Aissue|/\d+)?
to exclude the irrelevant pagesIn the end it indexes an arbitrary number of pages, 500-2000.
Start-1: https://github.com/red/red/issues?page=1&q=is%3Aissue
Start-2: https://github.com/red/red/issues?page=75&q=is%3Aissue
Start-3: https://github.com/red/red/issues?page=150&q=is%3Aissue
Same result, maybe in the range 1000-2000.
The issue seems to be in that sometimes Github fails to return the page, and the whole crawl sequence breaks. Re-loading the failed pages obviously won't let me crawl them (so whole pages of issues are lost).
https://github.com/red/red/issues/\d+
from 1 to 5535, running advanced crawler on this link list with depth=0. This must be bulletproof, right? Wrong...In result I get around 100-200 pages only. Most of the pages seem to show
link, detected from context
error in Index Browser (whatever that means!?), even though these are correct issue webpages.Some (<30%) of the links do redirect to
/pulls
and/discussions
subpaths, but get indexed despite my efforts to avoid that by setting filtering tosubpath
or tohttps://github.com/red/red/issues/\d+
mask. Filter just gets ignored.Typical log example: indexing-log-github.pdf
Another issue I notice is that whenever I start a new crawl, the previously crawled pages (in the same subpath) start to gradually disappear from the index, despite all settings being set to "don't delete anything".
I appreciate fixes to these issues and instructions on what can I do to work around them.
Using a docker install, default settings, robinson mode.
yacy_v1.940_202407241507_d181b9e89
The text was updated successfully, but these errors were encountered: