-
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
executeJs can "hang" due to a crashed chrome renderer #255
Comments
Hi Andy, I think there's a single chrome renderer per incognito window that gets created. Hero by default will share 10 hero instances across a single chrome instance, but that's configurable with the maxHeroes... variables. You could experiment to see if those help at all. I don't know how to get those PIDs. I agree we should add a timeout publicly to executeJs. Should be easy enough to fork and create your own version that does so. ExecuteJs is in many ways meant to just be a template to use as your starting point. I suspect you're crashing because outerHTML at a top level requires chrome to re-render the entire page, and to do so, it has to lock the event loop and redraw. It's called a reflow. Fwiw, you're measuring something that is already done in hero. Hero is already recording all the individual dom changes and tracking them in the DomStateChanges table so that it can rebuild your page. You might find that creating a fork of Hero where you make those statistics available to your front-end more efficiently solves your problem. Would be open to a PR that does that too. Hope that helps |
When you speak of maxHeroes, I need to tell you that I am doing this:
... which is an exported function of my own that I use to create (first time) a ConnectionToHeroCore I think you call this: 'running on one server (full stack deployment)".
That's why I was wondering: do I even need to have native chrome installed on my ec2 instance? This is what I am doing when I create the AMI:
Maybe this "real" chrome is not needed/used? |
This is the feature I mean: https://ulixee.org/docs/hero/overview/configuration#core-start The chrome you're installing won't be used by Hero, so kind of pointless unless you're pointing at it with the env vars, in which case, it will occasionally not correctly mask the browser variables |
When I looked at the link you gave me, it allows options to the HeroCore constructor for both Meanwhile, you can see that what I tried to do was modeled after the full stack deployment: and there I set neither of those options but set maxConncurrency for new ConnectionToHeroCore(). The strange thing is that peeking under the covers, whenever I created a new Hero(), I always Not sure if that is more or less advisable than sharing a single chrome across multiple concurrent urls, Speaking of the crashpad-handler, I verified that it is NOT hooked up on my mac, but IS hooked up Obviously, it would be nicer if you could make sure that the crashpad is not enabled, but regardless, On the crashed chromes, here are some observations but first let me give you some Wait functions that I tried:
With real pages, especially media sites, like bbc.com and theguardian.com, my guess is that they are injecting
My overall error rate is about 1 per 7, meaning that with the waitForFrame() function that you see above, |
Hi Blake,
I updated this issue with some more discoveries about the trigger for the hang
due to a chrome crash caused by executeJs() being called (too soon) for an iframe,
in the hopes that it is a bug that you can actually fix or workaround.
Let me know if you would like my localhost examples of non-url based child frames.
Andy
…________________________________
From: Blake Byrnes ***@***.***>
Sent: Monday, March 25, 2024 8:29 AM
To: ulixee/hero ***@***.***>
Cc: andy stagirite.com ***@***.***>; Author ***@***.***>
Subject: Re: [ulixee/hero] executeJs can "hang" due to a crashed chrome renderer (Issue #255)
This is the feature I mean: https://ulixee.org/docs/hero/overview/configuration#core-start
The chrome you're installing won't be used by Hero, so kind of pointless unless you're pointing at it with the env vars, in which case, it will occasionally not correctly mask the browser variables
—
Reply to this email directly, view it on GitHub<#255 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAROFVS7SBJXCLXRNOYJOSDY2A7HDAVCNFSM6AAAAABFED46CGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJYGI4DCMZYGA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
1.html.txt Just put these files into some folder of a localhost web server, and strip the .txt extension, |
I have written my own waitForFrameLoad (also for the main page) which works in a unique way,
which I came up with because some test pages that used iframes in a variety of ways other
than thru urls where failing in the normal
frame.waitForLoad()
.This is a function I am using to inject into the dom with executeJs, on the assumption that
if it works, then the frame is nearly ready and so is the main page.
Then, in a loop I inject this javascript snippet with the executeJs plugin:
would have been passed to
frame.waitForLoad()
Here's the bug: under load, and when using several concurrent Hero scrapers on an ec2 instance
with plenty of memory and cpu, it seems that my very aggressive use of executeJs plugin somehow
triggers the ulixee chrome renderer to crash/disconnect, more so than normal. Another aspect of
this is that I am going after ALL iframes up to a depth of 2 (relative to the main frame), and that
when this behavior manifests, it appears that there are sometimes 50-100 frameIds that I have
been aggressively doing this executeJs trick with. Anyways, I wrote some nodejs code to use
the shell command
ps -p {pid_list}
to somewhat infer that indeed, the chrome "renderer"process was crashing.
But the reason that I am reporting this as a bug that might be fixable is that frame.executeJs()
does not offer a timeout argument, and clearly my injection does very little, though perhaps
sooner than Hero is used to (!), so there should no reason for this executeJs to "hang".
Again, the reason that it is hanging is that either the chrome "renderer" has already crashed,
or the act of doing this so many times for the same Hero instance causes the crash.
My solution was to use a "race" promise to throw an error right away, which I treat as follows,
rather than restarting my server, I hope that if I close Hero, your code will detect that the
chrome has crashed/disconnected, and things will recover. I immediately retry the scrape
and do not go after iframes so aggressively.
So these might be the possible bugs:
crash within or due to executeJs
NOTE: before I ramped up the load and discovered this problem, I did see this
exception being thrown at exactly 60 secs, but rarely: a ulixxe TimeoutError
whose message is 'DevtoolsApiMessage did not respond after 60 seconds.'
Maybe that's what I am seeing as a hang, but 60 seconds seems to be quite
a long time to wait for this information.
By the way, at one point, I was hoping that for a given Hero instance with many many
iframes, I would be allowed to use my executeJs trick in parallel/independently.
I guess that there may be nothing at all to do in your codebase, but I wanted to report
this anyways, and learn more about what is going on under the covers: for example,
is there a way for me to reliably get the "pid" for the ulixxee chrome renderer?
Also, what is this chrome renderer? Is there a 1:1 relationship between it and
a Hero instance? If this chrome renderer is NOT chrome, but is another ulixxee process,
then I would think that it should NEVER crash/disconnect, especially in a way such
as this: that I can easily force to happen within about 15 scrapes, especially if I
am going after media sites that are ad-heavy (and therefore with lots of those
unnecessary iframes).
At a minimum, it would be nice if Hero would provide a way for me to get the "pid"
of this crashable renderer (assuming I am reasoning correctly about this issue).
(BTW: I have ensured that the latest chrome itself is being installed on ec2-instances.
Is this being used at all? Because I'm not seeing it listed as a direct or indirect
child process of my server which embeds the hero code. Does the ulixxee
chrome renderer app replace chrome in some way?)
The reason that I am seeing this as a Hero bug for now, is that the rate at which
this happens is at about the threshold where I cannot go to a larger ec2 instance
size for fear that the retry rate (due to the sporadic crashed chrome renderer) would
mean that I was wasting the extra processing power.
The text was updated successfully, but these errors were encountered: