-
-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement recursion #1603
base: master
Are you sure you want to change the base?
Implement recursion #1603
Conversation
seems like formatting rules from my cargo install differs from ci's? i get nothing when running cargo fmt locally... |
@@ -181,24 +193,56 @@ where | |||
Ok(()) | |||
} | |||
|
|||
/// Reads from the request channel and updates the progress bar status | |||
async fn progress_bar_task( | |||
/// Reads from the response channel, updates the progress bar status and (if recursing) sends new requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: recursing
-> recurring
/ recursive
Great progress on implementing recursion! The core implementation using request/response channels with the
I hope that I covered all open points for now. Let me know if I missed anything. 😃 |
Probably. We don't have any specific rustfmt settings. Can you disable your global config to see if that changes things? RUSTFMT_CONFIG_PATH=/dev/null cargo fmt |
okay, so i figured out why it hang, and it was as you guessed a backpressure problem, I was waiting on the subsequent URI sends before finishing to process a response, so the response-receiver task would not move on to process the response, resulting in a lock-up. I fixed this by putting the requests channel "refill" in a tokio::spawn. I don't know if having that much tasks spawned is a good idea in Tokio, but I guess one of the points of using a async runtime is that it manages these things for us? unfortunately the problem has not completely gone away, for some reason it locks up on the last URL on en.wikipedia.org with --max-depth=0 (which is functionnaly the same as not activating recursion1), but works without --recursive... theres also the issue of how we should handle a "cached" response in recursive mode: right now it's considered a cached response as if it came from .lycheecache, but this isn't really true. I'm tempted to make a separate Because of the parallel nature of the request-to-response task, it seems to me that sending the same request twice to the channel is hard to prevent. i tried adding guards basically everywhere (when getting the list of subsequent uris to add to the requests channel, before processing a request, ...) and i still seem to get duplicates. it might be a skill issue though. or maybe we should use an ArcMutex instead of just an Arc for the cache? i'll have to try that later. Finally, I'm a bit on the fence on this, but for now I added a .insert method to the Stats struct to prevent adding duplicate entries for the same URL, as a stopgap measure for the problem mentionned above. I'll get around to work on this again sometime this week (maybe), and I think i'll start testing by writing tests instead of checking manually because it's starting to become cumbersome, and this way i'll progress on the writing of tests too. By the way, I discovered tokio-console, and tried to use it to debug the backpressure bug. while it was not extremely helpful, it did help me see whether I still had some tasks doing stuff or if there was just nothing going on anymore. So right now I have some dirty git changes related to tokio-console setup (another dep and a call to a hook/setup function in main()). Do you want me to commit these (of course, by enabling the tracing for the dev profile only) Footnotes
|
match response.text().await { | ||
Err(_) => (status, vec![]), | ||
Ok(response_text) => { | ||
let links: Vec<_> = Collector::new(None, Some(Base::Remote(base_url))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: use the collector that's constructed around main
, since some flags are passed to it, while we are ouright ignoring a bunc of flags there
Thanks for the detailed update @ewen-lbh! Let me address your points:
I enjoy reading the progress reports. 👍 |
Closes #78
Current usage:
Example on a small site: (literally a friend's portfolio site because it's what i pulled off the top of my head haha)
The idea of the implementation is the following:
subsequent_uris
in most of the code)Arc<AtomicUsize>
is used to keep track of the number of remaining requests to check: (i guess this is similar to the Semaphore approach?)len(URIs)
What's WIP:
Vec<Uri>
for the subsequent uris? this gets passed around a lot, it's probably inefficient