Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executeJs can "hang" due to a crashed chrome renderer #255

Open
andynuss opened this issue Mar 22, 2024 · 6 comments
Open

executeJs can "hang" due to a crashed chrome renderer #255

andynuss opened this issue Mar 22, 2024 · 6 comments

Comments

@andynuss
Copy link

andynuss commented Mar 22, 2024

I have written my own waitForFrameLoad (also for the main page) which works in a unique way,
which I came up with because some test pages that used iframes in a variety of ways other
than thru urls where failing in the normal frame.waitForLoad().

This is a function I am using to inject into the dom with executeJs, on the assumption that
if it works, then the frame is nearly ready and so is the main page.

function rawDomCapture(): RawDom {
  const result: RawDom = {};
  try {
    const win: any = window;
    const doc = win.document;
    const html: string = doc.documentElement?.outerHTML;
    if (typeof html === 'string') {
      let location: string = win.location.href;
      if (location == null) location = '';
      result.dom = { html, location };
    }
  } catch (e) {
    result.error = '' + e.stack;
  }
  return result;
}

Then, in a loop I inject this javascript snippet with the executeJs plugin:

  • the first time immediately
  • if undefined is returned, by trial and error, I learned that the frame/page is definitely not ready
    • in this case, wait 2500 millis, and repeat
  • if there is an error, something bad has already happened, like a detected "disconnect" (becomes a retry error, sometimes)
  • if successful, then if I got the outerHtml, "diff" it with the previous version to ensure the frame/page has "stabilized"
    • obviously, if this is the 1st successful dom, then I have to do another iteration
    • if I have to try again because the diff "percent" was too big, then wait 4000 millis
  • in all cases, if I do get at least one successful dom, then wait no longer than the maxium timeoutMs that
    would have been passed to frame.waitForLoad()

Here's the bug: under load, and when using several concurrent Hero scrapers on an ec2 instance
with plenty of memory and cpu, it seems that my very aggressive use of executeJs plugin somehow
triggers the ulixee chrome renderer to crash/disconnect, more so than normal. Another aspect of
this is that I am going after ALL iframes up to a depth of 2 (relative to the main frame), and that
when this behavior manifests, it appears that there are sometimes 50-100 frameIds that I have
been aggressively doing this executeJs trick with. Anyways, I wrote some nodejs code to use
the shell command ps -p {pid_list} to somewhat infer that indeed, the chrome "renderer"
process was crashing.

But the reason that I am reporting this as a bug that might be fixable is that frame.executeJs()
does not offer a timeout argument, and clearly my injection does very little, though perhaps
sooner than Hero is used to (!), so there should no reason for this executeJs to "hang".

Again, the reason that it is hanging is that either the chrome "renderer" has already crashed,
or the act of doing this so many times for the same Hero instance causes the crash.

My solution was to use a "race" promise to throw an error right away, which I treat as follows,
rather than restarting my server, I hope that if I close Hero, your code will detect that the
chrome has crashed/disconnected, and things will recover. I immediately retry the scrape
and do not go after iframes so aggressively.

So these might be the possible bugs:

  1. tell me that devtools just can't handle the crazy thing I am doing, and is much more likely to
    crash within or due to executeJs
  2. maybe Hero is better equipped to detect the agent crash than I am

NOTE: before I ramped up the load and discovered this problem, I did see this
exception being thrown at exactly 60 secs, but rarely: a ulixxe TimeoutError
whose message is 'DevtoolsApiMessage did not respond after 60 seconds.'

Maybe that's what I am seeing as a hang, but 60 seconds seems to be quite
a long time to wait for this information.

By the way, at one point, I was hoping that for a given Hero instance with many many
iframes, I would be allowed to use my executeJs trick in parallel/independently.


I guess that there may be nothing at all to do in your codebase, but I wanted to report
this anyways, and learn more about what is going on under the covers: for example,
is there a way for me to reliably get the "pid" for the ulixxee chrome renderer?
Also, what is this chrome renderer? Is there a 1:1 relationship between it and
a Hero instance? If this chrome renderer is NOT chrome, but is another ulixxee process,
then I would think that it should NEVER crash/disconnect, especially in a way such
as this: that I can easily force to happen within about 15 scrapes, especially if I
am going after media sites that are ad-heavy (and therefore with lots of those
unnecessary iframes).

At a minimum, it would be nice if Hero would provide a way for me to get the "pid"
of this crashable renderer (assuming I am reasoning correctly about this issue).

(BTW: I have ensured that the latest chrome itself is being installed on ec2-instances.
Is this being used at all? Because I'm not seeing it listed as a direct or indirect
child process of my server which embeds the hero code. Does the ulixxee
chrome renderer app replace chrome in some way?)

The reason that I am seeing this as a Hero bug for now, is that the rate at which
this happens is at about the threshold where I cannot go to a larger ec2 instance
size for fear that the retry rate (due to the sporadic crashed chrome renderer) would
mean that I was wasting the extra processing power.

@blakebyrnes
Copy link
Contributor

Hi Andy, I think there's a single chrome renderer per incognito window that gets created. Hero by default will share 10 hero instances across a single chrome instance, but that's configurable with the maxHeroes... variables. You could experiment to see if those help at all. I don't know how to get those PIDs.

I agree we should add a timeout publicly to executeJs. Should be easy enough to fork and create your own version that does so. ExecuteJs is in many ways meant to just be a template to use as your starting point.

I suspect you're crashing because outerHTML at a top level requires chrome to re-render the entire page, and to do so, it has to lock the event loop and redraw. It's called a reflow. Fwiw, you're measuring something that is already done in hero. Hero is already recording all the individual dom changes and tracking them in the DomStateChanges table so that it can rebuild your page. You might find that creating a fork of Hero where you make those statistics available to your front-end more efficiently solves your problem. Would be open to a PR that does that too.

Hope that helps

@andynuss
Copy link
Author

andynuss commented Mar 25, 2024

When you speak of maxHeroes, I need to tell you that I am doing this:

export async function getCreateConnection(): Promise<ConnectionToHeroCore> {
  if (connection !== undefined) return connection;
  const bridge = new TransportBridge();
  const maxConcurrency = maxAgents;
  const connectionToCore = new ConnectionToHeroCore(bridge.transportToCore, { maxConcurrency } );
  const heroCore = new HeroCore();
  heroCore.addConnection(bridge.transportToClient);
  connection = connectionToCore;
  core = heroCore;
  return connectionToCore;
}

... which is an exported function of my own that I use to create (first time) a ConnectionToHeroCore
and cache it in a variable, and use it each time I want to create a Hero instance. So my maxAgents
is configured to 12 on this machine, and I strive to keep 10 concurrent scrapes happening. Each
Hero is what I call an "agent" (in the old terminology), and it does appear that there is one
chrome "renderer" PID per Hero created this way.

I think you call this: 'running on one server (full stack deployment)".

  1. So, Maybe I shouldn't be using a separate Hero instance for each concurrent scrape session?
  2. Would you like me to give you the full path of the 10+ chrome "renderers"? Are they your code or Chromes? I ask because their path is in ulixee's node_modules.
  3. It also appears that these "renderers" are supposed to be terminated when a Hero instance is closed, but sometimes they stay around "forever". More research to do on this.
  4. as far as the "hang", it seems to be related not to a hang inside the renderer during executeJs, but that the "renderer" for this scrape has disappeared, I assume crashed, and so the question becomes, whose code is it, and what was the trigger?

That's why I was wondering: do I even need to have native chrome installed on my ec2 instance?

This is what I am doing when I create the AMI:

cat <<EOF | sudo tee /etc/yum.repos.d/google-chrome.repo > /dev/null
[google-chrome]
name=google-chrome
baseurl=http://dl.google.com/linux/chrome/rpm/stable/\$basearch
enabled=1
gpgcheck=1
gpgkey=https://dl-ssl.google.com/linux/linux_signing_key.pub
EOF

sudo yum install -y google-chrome-stable xorg-x11-server-Xvfb

Maybe this "real" chrome is not needed/used?

@blakebyrnes
Copy link
Contributor

This is the feature I mean: https://ulixee.org/docs/hero/overview/configuration#core-start

The chrome you're installing won't be used by Hero, so kind of pointless unless you're pointing at it with the env vars, in which case, it will occasionally not correctly mask the browser variables

@andynuss
Copy link
Author

When I looked at the link you gave me, it allows options to the HeroCore constructor for both
maxConcurrentClientCount and maxConcurrentClientsPerBrowser.

Meanwhile, you can see that what I tried to do was modeled after the full stack deployment:
https://ulixee.org/docs/hero/advanced-concepts/deployment

and there I set neither of those options but set maxConncurrency for new ConnectionToHeroCore().

The strange thing is that peeking under the covers, whenever I created a new Hero(), I always
got a new chrome renderer.

Not sure if that is more or less advisable than sharing a single chrome across multiple concurrent urls,
that is, from emulator standpoint, but from crash standpoint, given that my ec2 instance has 16 GB heap,
it made sense not to let heap issues and crash-inducing issues infect other the other scraping sessions.


Speaking of the crashpad-handler, I verified that it is NOT hooked up on my mac, but IS hooked up
on the ec2 instance (linux), so I decided to grab the crashpad-handler-pid from the launch command
of the first chrome renderer I see, and kill it immediately.

Obviously, it would be nicer if you could make sure that the crashpad is not enabled, but regardless,
this is not related to the crashed chrome renderers that I am seeing.


On the crashed chromes, here are some observations but first let me give you some Wait functions that I tried:

function LogIfNotTimeout(frameNode: FrameNode, err: any): void {
  if (err instanceof TimeoutError) return;
  const { id: frameId, isMain } = frameNode;
  const whichFrame = isMain ? 'main frame' : `frameId ${frameId}`;
  console.log(`unexpected error in waitForLoad for ${whichFrame}`, err);
}

async function WaitForDomLoad(
  frameNode: FrameNode,
  timeoutMs: number,
): Promise<boolean> {
  try {
    await frameNode.frame.waitForLoad('DomContentLoaded', { timeoutMs });
    return true;
  } catch (e) {
    LogIfNotTimeout(frameNode, e);
    return false;
  }
}

async function WaitForJavascriptReady(
  frameNode: FrameNode,
  timeoutMs: number,
): Promise<boolean> {
  // FIXME: this only works for url-based iframes.  Need to file an
  // issue for Blake for srcdoc, javascript protocol, etc
  try {
    await frameNode.frame.waitForLoad('JavascriptReady', { timeoutMs });
    return true;
  } catch (e) {
    LogIfNotTimeout(frameNode, e);
    return false;
  }
}

async function WaitForDomContentLoaded(
  frameNode: FrameNode,
  timeoutMs: number,
): Promise<void> {
  const { frame } = frameNode;
  let timeout: NodeJS.Timeout | undefined;
  const cancelPromise: Promise<void> = new Promise((resolve) => {
    timeout = setTimeout(() => {
      timeout = undefined;
      resolve();
    }, timeoutMs);
  });
  await Promise.race([cancelPromise, frame.isDomContentLoaded]);
  if (timeout) clearTimeout(timeout);
}

async function WaitForPause(
  frameNode: FrameNode,
  initialPauseMillis: number,
): Promise<void> {
  const { discoveredAt } = frameNode;
  const elapsed = Date.now() - discoveredAt;
  const sleepMs = initialPauseMillis - elapsed;
  if (sleepMs > 0) await sleep(sleepMs);
}

export async function waitForFrame(
  varAgent: VarAgent,
  scraperArg: ScraperArg,
  scraperState: ScraperTrace,
  frameNode: FrameNode,
  waitLoadMillis?: number,
): Promise<boolean> {
  const { isMain } = frameNode;
  if (isMain) {
    // the main frame we wait for DomContentLoaded for a bit
    await WaitForDomLoad(frameNode, 15000);
  } else {
    await WaitForPause(frameNode, 7500);
  }
  if (waitLoadMillis == null) waitLoadMillis = isMain ? 60000 : 30000;
  // NOTE: WaitDomStable() is where I immediately start calling executeJs().
  const isStable = await WaitDomStable(varAgent, scraperArg, scraperState, frameNode, waitLoadMillis);
  return isStable;
}
  1. if my frameNode.isMain is true (derived from your FrameEnvironment object), I never get a crash
    so long as I do call WaitForDomLoad() before I call WaitDomStable.

  2. but for frames (not main), the analog, which I believe is WaitForJavascriptReady() does significantly
    worse than WaitForPause(), and even worse than not doing any "pre-waiting" before calling the
    risky WaitDomStable().

  3. this seems to imply that WaitDomStable() (which I described before as immediately calling executeJs()
    to grab the outerHtml and reach "stability") when called "too soon", actually triggers the crash.

  4. but this seems to relate to why I abandoned use of WaitForJavascriptReady(), because with unit testing
    of all sorts of iframes that were not url-based, I found that WaitForJavascriptReady() ALWAYs crashed chrome
    when the page's iframes where non-url based. (see below)

  5. likewise, for my non-url based iframe test-suite (on my localhost test server), WaitDomStable() also
    always crashed the chrome renderer (executeJs() called "too soon"). This is why I settled on WaitForPause()
    as being the best I could do.

  6. here's an url that ALWAYS fails for me in WaitDomStable() for one of the iframes (seems to be the same one)
    even after WaitForPause(): https://www.haproxy.com/blog/haproxy-is-not-affected-by-the-http-2-rapid-reset-attack-cve-2023-44487

With real pages, especially media sites, like bbc.com and theguardian.com, my guess is that they are injecting
both ad-frames and "noise" frames that typically do not use direct urls, because this technique prevents
adblock from doing anything about it. So these are the types of iframes I am speaking of that I am sure
based on my localhost testing can trigger the crash (either in executeJs() or frame.waitForLoad('JavascriptReady', { timeoutMs }):

  • srcdoc iframes
  • an iframe where the src is not an url but a javascript protocol containing the frames' ENTIRE code.
  • an iframe that is essentially an about:blank frame where the main page's onload event injects javascript
    • this may include not only an iframe that is initially about:blank, but one without an src attribute at all, etc

My overall error rate is about 1 per 7, meaning that with the waitForFrame() function that you see above,
and given that I usually try scrape child frame doms (except for obvious domains like github.com),
about 1 out of 7 pages causes the crash inside waitForFrame(). Many of these will work if attempted
a 2nd or 3rd time, even still going after the iframes. But since speed is important, I chose to
not get any iframe doms via executeJs() on the 2nd attempt.

@andynuss
Copy link
Author

andynuss commented Mar 28, 2024 via email

@andynuss
Copy link
Author

1.html.txt
about.html.txt
emptysrc.html.txt
emptysrc2.html.txt
javascript.html.txt
nosrc.html.txt
srcdoc.html.txt

Just put these files into some folder of a localhost web server, and strip the .txt extension,
and then try to scrape the outerHtml of all available frames via executeJs() (for all but the first).

blakebyrnes added a commit to ulixee/unblocked that referenced this issue Mar 30, 2024
soundofspace pushed a commit that referenced this issue Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants