Implement recursion #1603

gwennlbh · 2025-01-04T23:43:33Z

Closes #78

Current usage:

cargo install --git https://github.com/gwennlbh/lychee --branch recursion
lychee -R https://example.org --recursed-domains example.org

Example on a small site: (literally a friend's portfolio site because it's what i pulled off the top of my head haha)

cargo install --git https://github.com/gwennlbh/lychee --branch recursion
lychee -R https://portfolio.whidix.dev --recursed-domains portfolio.whidix.dev

The idea of the implementation is the following:

We keep the two mpsc channels, and we just send more requests to the channel after getting new links from responses
We let the already-working cache deal with cycle avoidance
Responses also send the URLs found in them to create requests for (subsequent_uris in most of the code)
An Arc<AtomicUsize> is used to keep track of the number of remaining requests to check: (i guess this is similar to the Semaphore approach?)

In the responses channel receiver loop:
- If the requests contains subsequent URIs: increment by len(URIs)
- decrement by 1
- if it reaches 0, break out of the .recv() while loop
In the initial fill of the channel (done iterating on the inputs stream, this was previously (and the function is still called like that) the function that closed the requests channel), we increment the counter on every request sent

The .recv() loop does not break at the beginning because, even if the counter is zero, it hasn't processed any request yet so we haven't reached the loop body

What's WIP:

Domain checking (for now the flag has to be specified, and it's a strict check only, it doesn't allow subdomains)
Max recursion depth flag
Max requests per second to avoid rate limiting? I saw github-specific handling but we should have a more general solution alongside the github handling imho
Renaming some functions, cleaning up debug print statements
Using something other than Vec<Uri> for the subsequent uris? this gets passed around a lot, it's probably inefficient
Trying to not break the (public?) api of lychee-lib too much? Current implementation adds a new method to the Handler trait...
Adding ✨ tests ✨

gwennlbh · 2025-01-05T00:09:12Z

seems like formatting rules from my cargo install differs from ci's? i get nothing when running cargo fmt locally...

nobkd · 2025-01-05T00:39:16Z

lychee-bin/src/commands/check.rs

@@ -181,24 +193,56 @@ where
    Ok(())
 }

-/// Reads from the request channel and updates the progress bar status
-async fn progress_bar_task(
+/// Reads from the response channel, updates the progress bar status and (if recursing) sends new requests.


note: recursing -> recurring / recursive

mre · 2025-01-06T11:21:04Z

Great progress on implementing recursion! The core implementation using request/response channels with the Arc<AtomicUsize> counter for tracking remaining requests looks solid. Let me address the open questions and design decisions:

Domain handling:
- For --recursed-domains, let's keep it simple and not support subdomain checking. For a link checker, strict domain matching makes more sense as it gives users precise control over the scope.
- Default behavior for --recursive without --recursed-domains: We should recurse into all input URLs provided as command line arguments. For files/paths as input, we can ignore them for now since recursive handling of file systems would be a separate concern.
Rate limiting: Let's keep this out of scope for this PR. While we'd love to have a general rate limiting solution in the future (using a "host" proxy pattern that can understand rate-limit headers and manage per-server queues), that's better handled as a separate feature. It's orthogonal to implementing recursion and would deserve its own focused PR.
Implementation details:
- Using Vec<Uri> for subsequent URIs is perfectly fine. Given our usage pattern (collect URLs, process them, move on), it's an efficient choice. The cache already handles deduplication, so we don't need a Set-like structure here. If the data structure turns out to be the bottleneck during profiling, we could think of tinyvec (or similar) as an alternative.
- Regarding the max-concurrency issues: Rather than working around it with high values, we need to fix the underlying issue where Tokio channels never unlock sends when filled up. This should be addressed at the implementation level rather than recommending particular concurrency values.
API stability: Breaking changes to lychee-lib's public API (including adding methods to the Handler trait) are acceptable. Since we only have a single Python library depending on an older version and it's not actively maintained, we have flexibility here.
Max recursion depth:
- Default: 5 levels (this covers most typical site structures)
- Should accept any positive integer (including 0 to completely disable max recursion depth)
- No maximum limit - users can set whatever depth makes sense for their use case

I hope that I covered all open points for now. Let me know if I missed anything. 😃

mre · 2025-01-06T11:24:09Z

seems like formatting rules from my cargo install differs from ci's? i get nothing when running cargo fmt locally...

Probably. We don't have any specific rustfmt settings. Can you disable your global config to see if that changes things?
Not sure if it works, but you could try

RUSTFMT_CONFIG_PATH=/dev/null cargo fmt

gwennlbh · 2025-01-07T23:58:07Z

okay, so i figured out why it hang, and it was as you guessed a backpressure problem, I was waiting on the subsequent URI sends before finishing to process a response, so the response-receiver task would not move on to process the response, resulting in a lock-up. I fixed this by putting the requests channel "refill" in a tokio::spawn. I don't know if having that much tasks spawned is a good idea in Tokio, but I guess one of the points of using a async runtime is that it manages these things for us?

unfortunately the problem has not completely gone away, for some reason it locks up on the last URL on en.wikipedia.org with --max-depth=0 (which is functionnaly the same as not activating recursion¹), but works without --recursive...

theres also the issue of how we should handle a "cached" response in recursive mode: right now it's considered a cached response as if it came from .lycheecache, but this isn't really true. I'm tempted to make a separate Arc<Cache> to have a separate "recursion cache", because you could have "real" cached responses from the file, but the results stats right now are "polluted" with duplicate | Error (cached) results.

Because of the parallel nature of the request-to-response task, it seems to me that sending the same request twice to the channel is hard to prevent. i tried adding guards basically everywhere (when getting the list of subsequent uris to add to the requests channel, before processing a request, ...) and i still seem to get duplicates. it might be a skill issue though. or maybe we should use an ArcMutex instead of just an Arc for the cache? i'll have to try that later.

Finally, I'm a bit on the fence on this, but for now I added a .insert method to the Stats struct to prevent adding duplicate entries for the same URL, as a stopgap measure for the problem mentionned above.

I'll get around to work on this again sometime this week (maybe), and I think i'll start testing by writing tests instead of checking manually because it's starting to become cumbersome, and this way i'll progress on the writing of tests too.

By the way, I discovered tokio-console, and tried to use it to debug the backpressure bug. while it was not extremely helpful, it did help me see whether I still had some tasks doing stuff or if there was just nothing going on anymore. So right now I have some dirty git changes related to tokio-console setup (another dep and a call to a hook/setup function in main()). Do you want me to commit these (of course, by enabling the tracing for the dev profile only)

Woops sorry I just read again your response, will adjust that so that inf=0 and root=1, instead of root=0 ↩

gwennlbh · 2025-01-08T00:08:39Z

lychee-lib/src/checker/website.rs

+                match response.text().await {
+                    Err(_) => (status, vec![]),
+                    Ok(response_text) => {
+                        let links: Vec<_> = Collector::new(None, Some(Base::Remote(base_url)))


todo: use the collector that's constructed around main, since some flags are passed to it, while we are ouright ignoring a bunc of flags there

mre · 2025-01-08T23:50:33Z

Thanks for the detailed update @ewen-lbh! Let me address your points:

The tokio::spawn approach for handling subsequent URI sends makes sense. Under the hood, Tokio maintains a work-stealing scheduler that efficiently distributes these tasks across worker threads. When we spawn tasks for URI sends, they'll be queued and executed when a worker thread becomes available, preventing blocking. The runtime handles backpressure and task scheduling automatically.
For the Wikipedia lockup issue with --max-depth=0: I suspect the issue might be that we're still filling in subsequent requests even when depth=0, even though we don't process them. This creates unnecessary overhead and could explain the different behavior from non-recursive mode. Could you add a test case to verify this theory?
Regarding cache handling: If I understand correctly, this is mainly about stats "pollution" rather than making unnecessary requests? In that case, we can probably defer this for now, since it's mostly a reporting issue. We'll likely address this naturally when we implement per-host stats (tracked in Add Per-Host Rate Limiting and Caching #1605). If we do want to add a cache, we could integrate it with the channel improvements discussed in The cache is ineffective with the default concurrency, for links in a website's theme #1593.
For the duplicate requests: Using an Arc<Mutex<Cache>> could help, but let's first try to identify where exactly the race condition occurs. Could you add some debug logging around the cache checks and request sending? I suspect we might have a small window where parallel tasks see the cache state before it's updated.
The Stats::insert method as a stopgap is reasonable for now, though we should mark it as a temporary solution with a TODO comment.
About tokio-console: Ah, right, I forgot we removed it in PR fix: Remove tokio console subscriber #1524 due to the tokio_unstable cfg conflicts with the Arch build. Unless we can find a way to cleanly handle the Arch build issues, we should probably hold off on reintroducing it. Let's focus on adding targeted debug logging instead.

I enjoy reading the progress reports. 👍

gwennlbh added 5 commits January 4, 2025 18:43

Declare --recursive and --recursed-domains

4740c56

wip

9f408c9

Fix cargo check

8385de6

Run cargo fmt

07048c8

Fix warnings from clippy

82782cb

gwennlbh force-pushed the recursion branch from ca13561 to 82782cb Compare January 5, 2025 00:05

nobkd reviewed Jan 5, 2025

View reviewed changes

Implement --max-depth and fix starvation issue!

a88f9fa

gwennlbh force-pushed the recursion branch from 8a55106 to a88f9fa Compare January 7, 2025 21:55

gwennlbh added 2 commits January 7, 2025 23:29

Fix handling had become serial

928241a

(desperately) try to fix redundant requests

c58b664

gwennlbh commented Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement recursion #1603

Implement recursion #1603

gwennlbh commented Jan 4, 2025 •

edited

Loading

gwennlbh commented Jan 5, 2025

nobkd Jan 5, 2025 •

edited

Loading

mre commented Jan 6, 2025

mre commented Jan 6, 2025

gwennlbh commented Jan 7, 2025 •

edited

Loading

gwennlbh Jan 8, 2025

mre commented Jan 8, 2025

Implement recursion #1603

Are you sure you want to change the base?

Implement recursion #1603

Conversation

gwennlbh commented Jan 4, 2025 • edited Loading

gwennlbh commented Jan 5, 2025

nobkd Jan 5, 2025 • edited Loading

Choose a reason for hiding this comment

mre commented Jan 6, 2025

mre commented Jan 6, 2025

gwennlbh commented Jan 7, 2025 • edited Loading

Footnotes

gwennlbh Jan 8, 2025

Choose a reason for hiding this comment

mre commented Jan 8, 2025

gwennlbh commented Jan 4, 2025 •

edited

Loading

nobkd Jan 5, 2025 •

edited

Loading

gwennlbh commented Jan 7, 2025 •

edited

Loading