Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement recursion #1603

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from
Draft

Implement recursion #1603

wants to merge 8 commits into from

Conversation

gwennlbh
Copy link

@gwennlbh gwennlbh commented Jan 4, 2025

Closes #78

Current usage:

cargo install --git https://github.com/gwennlbh/lychee --branch recursion
lychee -R https://example.org --recursed-domains example.org 

Example on a small site: (literally a friend's portfolio site because it's what i pulled off the top of my head haha)

cargo install --git https://github.com/gwennlbh/lychee --branch recursion
lychee -R https://portfolio.whidix.dev --recursed-domains portfolio.whidix.dev

The idea of the implementation is the following:

  1. We keep the two mpsc channels, and we just send more requests to the channel after getting new links from responses
  2. We let the already-working cache deal with cycle avoidance
  3. Responses also send the URLs found in them to create requests for (subsequent_uris in most of the code)
  4. An Arc<AtomicUsize> is used to keep track of the number of remaining requests to check: (i guess this is similar to the Semaphore approach?)
  • In the responses channel receiver loop:
    • If the requests contains subsequent URIs: increment by len(URIs)
    • decrement by 1
    • if it reaches 0, break out of the .recv() while loop
  • In the initial fill of the channel (done iterating on the inputs stream, this was previously (and the function is still called like that) the function that closed the requests channel), we increment the counter on every request sent
  1. The .recv() loop does not break at the beginning because, even if the counter is zero, it hasn't processed any request yet so we haven't reached the loop body

What's WIP:

  • Domain checking (for now the flag has to be specified, and it's a strict check only, it doesn't allow subdomains)
  • Max recursion depth flag
  • Max requests per second to avoid rate limiting? I saw github-specific handling but we should have a more general solution alongside the github handling imho
  • Renaming some functions, cleaning up debug print statements
  • Using something other than Vec<Uri> for the subsequent uris? this gets passed around a lot, it's probably inefficient
  • Trying to not break the (public?) api of lychee-lib too much? Current implementation adds a new method to the Handler trait...
  • Adding ✨ tests ✨

@gwennlbh
Copy link
Author

gwennlbh commented Jan 5, 2025

seems like formatting rules from my cargo install differs from ci's? i get nothing when running cargo fmt locally...

@@ -181,24 +193,56 @@ where
Ok(())
}

/// Reads from the request channel and updates the progress bar status
async fn progress_bar_task(
/// Reads from the response channel, updates the progress bar status and (if recursing) sends new requests.
Copy link

@nobkd nobkd Jan 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: recursing -> recurring / recursive

@mre
Copy link
Member

mre commented Jan 6, 2025

Great progress on implementing recursion! The core implementation using request/response channels with the Arc<AtomicUsize> counter for tracking remaining requests looks solid. Let me address the open questions and design decisions:

  1. Domain handling:

    • For --recursed-domains, let's keep it simple and not support subdomain checking. For a link checker, strict domain matching makes more sense as it gives users precise control over the scope.
    • Default behavior for --recursive without --recursed-domains: We should recurse into all input URLs provided as command line arguments. For files/paths as input, we can ignore them for now since recursive handling of file systems would be a separate concern.
  2. Rate limiting: Let's keep this out of scope for this PR. While we'd love to have a general rate limiting solution in the future (using a "host" proxy pattern that can understand rate-limit headers and manage per-server queues), that's better handled as a separate feature. It's orthogonal to implementing recursion and would deserve its own focused PR.

  3. Implementation details:

    • Using Vec<Uri> for subsequent URIs is perfectly fine. Given our usage pattern (collect URLs, process them, move on), it's an efficient choice. The cache already handles deduplication, so we don't need a Set-like structure here. If the data structure turns out to be the bottleneck during profiling, we could think of tinyvec (or similar) as an alternative.
    • Regarding the max-concurrency issues: Rather than working around it with high values, we need to fix the underlying issue where Tokio channels never unlock sends when filled up. This should be addressed at the implementation level rather than recommending particular concurrency values.
  4. API stability: Breaking changes to lychee-lib's public API (including adding methods to the Handler trait) are acceptable. Since we only have a single Python library depending on an older version and it's not actively maintained, we have flexibility here.

  5. Max recursion depth:

    • Default: 5 levels (this covers most typical site structures)
    • Should accept any positive integer (including 0 to completely disable max recursion depth)
    • No maximum limit - users can set whatever depth makes sense for their use case

I hope that I covered all open points for now. Let me know if I missed anything. 😃

@mre
Copy link
Member

mre commented Jan 6, 2025

seems like formatting rules from my cargo install differs from ci's? i get nothing when running cargo fmt locally...

Probably. We don't have any specific rustfmt settings. Can you disable your global config to see if that changes things?
Not sure if it works, but you could try

RUSTFMT_CONFIG_PATH=/dev/null cargo fmt

@gwennlbh
Copy link
Author

gwennlbh commented Jan 7, 2025

okay, so i figured out why it hang, and it was as you guessed a backpressure problem, I was waiting on the subsequent URI sends before finishing to process a response, so the response-receiver task would not move on to process the response, resulting in a lock-up. I fixed this by putting the requests channel "refill" in a tokio::spawn. I don't know if having that much tasks spawned is a good idea in Tokio, but I guess one of the points of using a async runtime is that it manages these things for us?

unfortunately the problem has not completely gone away, for some reason it locks up on the last URL on en.wikipedia.org with --max-depth=0 (which is functionnaly the same as not activating recursion1), but works without --recursive...

theres also the issue of how we should handle a "cached" response in recursive mode: right now it's considered a cached response as if it came from .lycheecache, but this isn't really true. I'm tempted to make a separate Arc<Cache> to have a separate "recursion cache", because you could have "real" cached responses from the file, but the results stats right now are "polluted" with duplicate | Error (cached) results.

Because of the parallel nature of the request-to-response task, it seems to me that sending the same request twice to the channel is hard to prevent. i tried adding guards basically everywhere (when getting the list of subsequent uris to add to the requests channel, before processing a request, ...) and i still seem to get duplicates. it might be a skill issue though. or maybe we should use an ArcMutex instead of just an Arc for the cache? i'll have to try that later.

Finally, I'm a bit on the fence on this, but for now I added a .insert method to the Stats struct to prevent adding duplicate entries for the same URL, as a stopgap measure for the problem mentionned above.

I'll get around to work on this again sometime this week (maybe), and I think i'll start testing by writing tests instead of checking manually because it's starting to become cumbersome, and this way i'll progress on the writing of tests too.

By the way, I discovered tokio-console, and tried to use it to debug the backpressure bug. while it was not extremely helpful, it did help me see whether I still had some tasks doing stuff or if there was just nothing going on anymore. So right now I have some dirty git changes related to tokio-console setup (another dep and a call to a hook/setup function in main()). Do you want me to commit these (of course, by enabling the tracing for the dev profile only)

Footnotes

  1. Woops sorry I just read again your response, will adjust that so that inf=0 and root=1, instead of root=0

match response.text().await {
Err(_) => (status, vec![]),
Ok(response_text) => {
let links: Vec<_> = Collector::new(None, Some(Base::Remote(base_url)))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: use the collector that's constructed around main, since some flags are passed to it, while we are ouright ignoring a bunc of flags there

@mre
Copy link
Member

mre commented Jan 8, 2025

Thanks for the detailed update @ewen-lbh! Let me address your points:

  1. The tokio::spawn approach for handling subsequent URI sends makes sense. Under the hood, Tokio maintains a work-stealing scheduler that efficiently distributes these tasks across worker threads. When we spawn tasks for URI sends, they'll be queued and executed when a worker thread becomes available, preventing blocking. The runtime handles backpressure and task scheduling automatically.

  2. For the Wikipedia lockup issue with --max-depth=0: I suspect the issue might be that we're still filling in subsequent requests even when depth=0, even though we don't process them. This creates unnecessary overhead and could explain the different behavior from non-recursive mode. Could you add a test case to verify this theory?

  3. Regarding cache handling: If I understand correctly, this is mainly about stats "pollution" rather than making unnecessary requests? In that case, we can probably defer this for now, since it's mostly a reporting issue. We'll likely address this naturally when we implement per-host stats (tracked in Add Per-Host Rate Limiting and Caching #1605). If we do want to add a cache, we could integrate it with the channel improvements discussed in The cache is ineffective with the default concurrency, for links in a website's theme #1593.

  4. For the duplicate requests: Using an Arc<Mutex<Cache>> could help, but let's first try to identify where exactly the race condition occurs. Could you add some debug logging around the cache checks and request sending? I suspect we might have a small window where parallel tasks see the cache state before it's updated.

  5. The Stats::insert method as a stopgap is reasonable for now, though we should mark it as a temporary solution with a TODO comment.

  6. About tokio-console: Ah, right, I forgot we removed it in PR fix: Remove tokio console subscriber #1524 due to the tokio_unstable cfg conflicts with the Arch build. Unless we can find a way to cleanly handle the Arch build issues, we should probably hold off on reintroducing it. Let's focus on adding targeted debug logging instead.

I enjoy reading the progress reports. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add recursive option
3 participants