Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pulsar-next] Figure out file-watcher approach that avoids crashes #1104

Open
5 tasks done
savetheclocktower opened this issue Sep 23, 2024 · 12 comments
Open
5 tasks done
Assignees
Labels
bug Something isn't working pulsar-next Related to the version of Pulsar that runs on the latest Electron

Comments

@savetheclocktower
Copy link
Contributor

savetheclocktower commented Sep 23, 2024

Thanks in advance for your bug report!

  • Have you reproduced issue in safe mode?
  • Have you used the debugging guide to try to resolve the issue?
  • Have you checked our FAQs to make sure your question isn't answered there?
  • Have you checked to make sure your issue does not already exist?
  • Have you checked you are on the latest release of Pulsar?

What happened?

As far as I can tell, nsfw is the most widely used library for listening to filesystem events. Atom has implemented a couple of their own , and though it's tempting to try to adapt one of those to the exact needs of modern Electron, it's not much fun.

The latest version of nsfw is almost perfect, but it still manages to crash Pulsar on window reload about half the time. After some research, this seems to happen because:

  • We set up a NativeWatcher in src/path-watcher.js on startup
  • When we reload the window, we call dispose and call stop on the inner nsfw watcher, which is an async operation that returns a promise; because of reasons explained below, this often starts a different watcher
  • The async operation of the start method outlives the web page that initiated it; when it resolves, it does something that causes a crash, presumably because it's trying to access something that no longer exists

I can think of a few ways to handle this:

  • Get good enough at C++ debugging to discover exactly why it's crashing; I'm confident nsfw’s maintainer would accept a patch.
  • Delay the unloading of the current page until the stop method resolves; I've hacked together a proof of concept and it actually works, but it's a bit impractical in the general case.
  • Find or write a magical library that can handle all our file-watching use cases with a flexible API. (I kid a bit, but we've failed to unify everything under a single file-watcher API in the past precisely because we have several different APIs to support.)
  • Don't start new listeners while the AtomEnvironment instance is being destroyed. (This is what we've implemented for now.)

Pulsar version

pulsar-next

Which OS does this happen on?

🍎 macOS

OS details

Sonoma 14.7

Which CPU architecture are you running this on?

Apple M-series

What steps are needed to reproduce this?

On PulsarNext…

  • Reload the window
  • It'll probably crash
  • If it doesn't, try a few more times

Additional Information:

No response

@savetheclocktower savetheclocktower added bug Something isn't working pulsar-next Related to the version of Pulsar that runs on the latest Electron labels Sep 23, 2024
@savetheclocktower savetheclocktower self-assigned this Sep 23, 2024
@savetheclocktower
Copy link
Contributor Author

Delay the unloading of the current page until the stop method resolves; I've hacked together a proof of concept and it actually works, but it's a bit impractical in the general case.

To elaborate on this:

I had a theory that nsfw wouldn't crash the renderer if it could guarantee disposal before the page unloaded. Since calling stop on a watcher is an async operation, this isn't elegant, since the whole page-unload mechanism (with the beforeunload event) predates promises and async in general.

Still, I tried this:

function unloadDeferralListener(event) {
  event.preventDefault();
  console.log('Waiting for unload…');
  NativeWatcher.waitForStop().then(() => {
    console.log('…done.');
    window.removeEventListener('beforeunload', unloadDeferralListener);
    location.reload();
  });
}
class NativeWatcher {
  static promises= new Set();
  static initialized = false;

  static async waitForStop() {
    let { promises } = NativeWatcher;
    if (promises.size === 0) {
      return Promise.resolve();
    }
    console.log('Waiting for', promises.size, 'promises');
    await Promise.allSettled(promises);
  }

  static initialize() {
    if (this.initialized) return;
    addEventListener('beforeunload', unloadDeferralListener);
    this.initialized = true;
  }

  // ...
}

The static NativeWatcher.initialize method is needed because addEventListener isn't yet available upon initial evaluation of this file.

In the stop method, I proceed as before, except that I add any pending stop operation to NativeWatcher.promises just after creation, and remove it from the same set when the promise is fulfilled.

This is hacky, but it works. It's hacky because it'd probably fall apart as a strategy if more than one thing on the page assumed it could use this technique.

But it's good as a best-alternative. If we fail to solve this in a more elegant fashion, I could zhuzh this up by creating a more formal internal API around it so that it could safely be used by more than one thing.

@savetheclocktower
Copy link
Contributor Author

Good news! My horrible idea from the previous comment won't be needed.

The specific crash I've been seeing is the fault of our path-watching strategy:

  • Suppose we want to watch /foo/bar and /foo/baz.
  • The watchers are requested in that order. At first, we watch /foo/bar directly, but once we see that /foo/baz wants to be watched as well, we combine them into one watcher that watches /foo.
  • If the /foo/bar watcher were stopped, we'd react by creating a new watcher for /foo/baz specifically. This is one of several ways in which disposing of one watcher can trigger the creation of another.

It took me a while to process the fact that the crash seemed to be happening during a call to watcher.start, not watcher.stop. I figured that was because the new page was initializing and running into some resources that were still locked or unfinalized or something from the previous run, but eventually I realized that those calls to watcher.start were happening on the original page while it was being unloaded.

Why? Because we were disposing of watchers en masse during page unload. None of those watchers realized that the point was to disable all watchers, so they tried to start new ones in a terminating renderer environment.

This is fixable simply by setting a flag at the top of PathWatcher::dispose that watchers can check to know whether they should reattach. If the whole PathWatcher is being disposed, there's no point in creating new watchers. A more thorough fix would envision that watchers tend to be disposed in batches, meaning we should consider not creating a new watcher before we know if it'll even be needed — but that's a lot harder.


So this is the right answer for now; setting that flag seems to consistently prevent renderer process crashes. Still, it's not great to know that there's a way for the renderer process to crash, even if we're using entirely thread-safe native libraries!

In fact, it happens because Node has no other way of signaling that something's gone wrong at that point:

By default, throwing an exception on a terminating environment (eg. worker threads) will cause a fatal exception, terminating the Node process. This is to provide feedback to the user of the runtime error, as it is impossible to pass the error to JavaScript when the environment is terminating. The NODE_API_SWALLOW_UNTHROWABLE_EXCEPTIONS preprocessor directive can be defined to bypass this behavior, such that the Node process will not terminate.

The ideal situation would be for nsfw to be able to recognize when it's being asked to watch files in a terminating environment and silently fail. The next-to-ideal situation would be for nsfw to define the NODE_API_SWALLOW_UNTHROWABLE_EXCEPTIONS directive in its binding.gyp.

I've filed an issue about this with nsfw, but if they don't decide to do either of these things, then one option for us would be to maintain a “fork“ of nsfw whose only divergence is defining that directive in binding.gyp. In my tests, that directive was enough to prevent the crashes even with no other changes.

It's not a high-priority thing, but it might be something we start thinking about if we find that a community package is using atom.watchPath sloppily.

@savetheclocktower
Copy link
Contributor Author

Good news! nsfw accepted my PR adding NODE_API_SWALLOW_UNTHROWABLE_EXCEPTIONS. I'll keep this open until they put out a new release with this change.

@savetheclocktower
Copy link
Contributor Author

savetheclocktower commented Oct 12, 2024

There's a second half of this that I failed to mention.

The old node-pathwatcher library is something that we'd love to eliminate our dependency on. In an ideal world we'd use nsfw for everything, but there's a catch: Atom/Pulsar have long exported the File and Directory classes provided by node-pathwatcher as-is.

Those classes are thin, pleasant abstractions around file reading/writing and directory iteration, but they also accept callbacks like onDidChange and onDidRename. That's where the trouble arises. You'd think it would be easy to adapt those classes to use nsfw, but it isn't; nsfw goes async in inconvenient places, whereas node-pathwatcher doesn't.

It's not a big deal to ask people to wait briefly when setting up file monitoring, but the issue is that there isn't an easy way to add asynchronicity to an existing API that lots of unmaintained packages rely on.

So there are a few options:

  1. Move File and Directory to use nsfw as elegantly as possible, accepting that it'll be a slight renegotiation of the API contract and hoping the impact is minimal.
  2. Modernize node-pathwatcher to be not only context-aware (as has been done by @mauricioszabo already in a branch) but context-safe (i.e., able to be used in several contexts at once without crashing).
  3. Find another library of any sort to build around. This isn't easy; we'd need something that both (a) has a synchronous watching API and (b) detects file renames (which are more heuristic than exact science).

I've spent more than a week on option 2 and made quite a bit of progress despite it being a C++-heavy exercise. It's a huge headache to rewrite the native bindings for a library that largely eschewed even what was provided to it by nan — preferring, bizarrely, to work directly with libuv and v8 wherever possible. (Who knows — maybe that actually made sense in 2014!)

The hardest part has been Windows support. I can get it to pass locally in almost all cases, but a few edge cases elude me, and I get different results locally than I do in CI.


So the best option might actually be to wrap another library. I just discovered efsw — a filesystem watcher written in C++. There's an NPM library that wraps it, but it uses nan and hasn't been updated in four years, so I'd most likely be writing another wrapper around it. But that's much less daunting than what I've been doing: rewriting an ad-hoc C++ filesystem-event-detection library that was last meaningfully changed a decade ago.

Amazingly, this might be the simplest option of them all; fingers crossed.

@mauricioszabo
Copy link
Contributor

@savetheclocktower a question about the problem you had with combining paths into a single watcher: was that in NSFW, right?

I'm asking because, for me, it would be amazing if we had one path watching solution, not two as we have today (like PathWatcher and NSFW). So if we could, somehow, make efsw available to Node, maybe we could eliminate NSFW completely, or at least keep the option of using it or not?

Some advantages that I see on using the same pathwatcher everywhere:

  1. We could reuse the same watching mechanism - basically, currently we can watch the same file multiple times, right? If we keep it centralized, maybe it'll be lighter on system resources?
  2. We have only one place to look for bugs, if they appear
  3. We might be able to use this "combine watchers" technique everywhere, again making it lighter for system resources and avoiding the whole "can't watch path" error that sometimes occur

Am I on the right track, or is this not possible for reasons you found out while investigating this issue?

@savetheclocktower
Copy link
Contributor Author

savetheclocktower commented Oct 12, 2024

Am I on the right track, or is this not possible for reasons you found out while investigating this issue?

The one issue I had with nsfw is the thing that they accepted a PR for, so I'm good with it. The behavior that caused that bug was on our side and involved us creating new watchers during environment teardown.

It's by far the most popular filesystem watcher in Node that meets our needs, so I feel much more confident rolling with nsfw than anything I could write, even something that wraps its own robust C++ library.

The upside of efsw is that it can shim this one weird API that we need in this one place; any more and we end up maintaing our own full-fledged watcher library and that's the last thing I want to do. The Atom folks went through like three iterations of that idea themselves and I would love to learn from their travails.

@savetheclocktower
Copy link
Contributor Author

OK, my best effort so far is this branch. It defers to efsw for the vast majority of file-change-detection logic; that library uses inotify on Linux, FSEvents (instead of the kqueue we'd been using) on macOS, and whatever API is used on Windows for this sort of thing.

There are some subtle differences in what efsw expects — for instance, it works much better when it's given real, un-symlinked paths on disk, and it only watches directories rather than files — but those are smoothed over rather easily either in the native code or in the JavaScript code that wraps around the native bindings.

I've got CI that tests the library in a multi-context environment (worker threads) in addition to the existing suite of specs, and everything is green right now. The only concession I really had to make in the specs was to introduce a small delay after each test; without it the tests ran together too much (since most tests target the same watched file/directory) and caused flakiness.

Using it in an Electron render process is still slightly painful. pathwatcher has always behaved such that, if you're watching even a single path on disk, it maintains an open handle and your script won't exit. This is typical for Node; for instance, it's also why your script won't exit if there's a pending task scheduled via setTimeout.

But it means you've got to be careful about unwatching everything in the case of a window reload! The equivalent scenario in a web browser — a not-yet-fulfilled setTimeout — doesn't prevent a page from reloading or closing, and that's probably the right move in the general case. The pain I suffered here is not 100% clear to me, but my best guess is that an open handle will simply “leak” into the next page reload process; it would explain why I was able to run Window: Reload once, but not a second time. (The second attempt caused it to hang — stuck forever waiting on a handle that won't ever close.)

The fix is pretty simple, as it turns out: be sure to call PathWatcher.closeAllWatchers as part of the window teardown process, ideally as early as possible. It worked when I attached it directly as a beforeunload listener, for instance. So far I've had success putting it at the top of unloadEditorWindow. If this turns out not to be enough in some cases, we could even make closeAllWatchers an async operation that allows time for all pending file-change callbacks to fire; sleeping for 50ms would probably be enough. And we could put it into AtomEnvironment::reload and only call this.applicationDelegate.reload() once we had awaited closeAllWatchers.

You might ask “how did pathwatcher ever work if it didn't do this, then?” — and I'm not sure I have an answer for you. Opening a project implicitly creates a bunch of file-watching operations; closing or reloading the same window will clean up nearly all of those operations… except for the ones that relate to buffers that have not yet been closed. It's legit to leave some buffer windows open, since we want to preserve the state of the project when it's reopened later.

I tried to add some logging to the existing stable pathwatcher in an effort to understand this better — surely it's doing its own cleanup somehow, right? I couldn't figure it out. One of my theories is that Electron’s behavior has changed after embracing Chromium’s process reuse model, and that perhaps even what pathwatcher has long been doing wouldn’t have worked anyway even if it could get around the whole context-aware thing.

In the meantime, we'll have to make that change in our cleanup hygiene (an early call to closeAllWatchers) once we move to this newer version of pathwatcher. For now I'm just looking forward to not having to touch C++ for a while.

@savetheclocktower
Copy link
Contributor Author

Everything was going great with the patchwatcher refactor… and then I realized that efsw is quite profligate when it comes to FSEvents. fseventsd on macOS apparently has a limit on concurrent subscribers, one that I ran up against today.

When trying to figure out why the tests suddenly weren't passing, I opened up Console.app and found a new error message I hadn't seen before:

error	17:33:24.577894-0700	fseventsd	too many clients in system (limit 1024)

This isn't a limit that I could see us hitting under normal circumstances… except that each individual request to watch a path counts as a single “client” in efsw. There's no good reason for this! Even a single FSEvents client can watch multiple paths, and (I believe) can change the set of paths it watches at any point after creation! It's just wasteful.

Right now, I think efsw is using one watcher per path that we ask it to watch. The good news is that, with a refactor, I bet I can get this down to 1 watcher per window.

Another possibility would be some sort of automatic failover to kqueue — which is the strategy that pathwatcher previously used, and which efsw also supports (more or less). I'll wade back into it and see if I can figure this out.

@savetheclocktower
Copy link
Contributor Author

The problem with kqueue and efsw is that it takes a full half-second after a watcher is added before it starts reporting events. This is not a problem we seemed to have with our earlier pathwatcher implementation that used kqueue.

Since this is only a problem on macOS, I've got a nuclear option in my back pocket: we could sidestep efsw entirely on macOS and replace it with my earlier solution from the n-api branch. This would be annoying for a few reasons (I really don't understand kqueue at all!) but would be closoest to the weird, reliable code we've had in pathwatcher for years.

The other option is one that I'm pretty bullish on: there's a weird piece of code in Pulsar whose purpose is to minimize the number of native file-watchers needed for our built-in file-watcher API that lives at src/path-watcher.js (which we've already converted to use nsfw. The idea is that, since native file watchers can watch recursively, it's better to consolidate watchers together (where possible) instead of spawning one for each path you want to watch. A watcher at ~/Code/foo/bar/baz and one at ~/Code/foo/bar/thud could be replaced by one watcher that listened at ~/Code/foo/bar. The wrapper code around the watchers would be smart enough to ignore any events that happen outside the path that each watcher cares about.

There's a huge opportunity here because are watchers are needlessly redundant. On a simple project I observed 91 watchers during a test — albeit a test in dev mode where a lot of packages‘ asset directories were being monitored for changes via dev-live-reload. In extreme circumstances I bet we could get that as low as two: one watcher for the root of the project, plus one watcher for ~/.pulsar (to detect changes to config.cson and styles.less, as well as any changes that happen inside ~/.pulsar/dev/packages).

I've copied the NativeWatcherRegistry class over to pathwatcher and am hacking on it now. It feels more complicated than it needs to be and it's more async by default than I want, but I think I can get it to work properly once I understand it better.

@savetheclocktower
Copy link
Contributor Author

savetheclocktower commented Oct 25, 2024

Preliminary results are positive. The 91 native watchers I noticed in a test window a few days ago are, with this new approach, already down to 13.

This can surely be streamlined further; I've just got to figure out how to do it. There's lots of redundant watching of all entries inside ~/.pulsar/dev/packages/x, ~/.pulsar/dev/packages/y, etc., that isn’t automatically upgraded to a single watcher on ~/.pulsar/dev/packages. But this is good enough to ship for now.

One of the side-effects of this approach in efsw is that there seem to be lots more “weird” filesystem events (on macOS, at least) than there were before, probably because efsw is now watching recursively instead of one level deep. Often a test will create a file, wait a brief period of time, set up a watcher on that file… and immediately get the file's creation event, even though it predates the watcher! Even if we make sure the file exists before we proceed!

Luckily, weird things like that can be filtered out pretty easily: in the C++, record a timestamp when you start watching for a given path, and then you'll be able to consult that timestamp later. If the file's own creation time predates the time you started watching, ignore its create event! (You can do the same with last-modified times on files when they fire change events.) This filtering seems only to be needed for FSEvents (which is reputationally chatty and a bit ditzy), so I don't have to bother with the headaches of writing Windows-equivalent code for this.

Anyway, this should be good enough in the near term. I'm sure there are some quirks that make it different from old pathwatcher, but we'll discover those as we test.

@savetheclocktower
Copy link
Contributor Author

This has been a time sink!

Updating pathwatcher was a tall task from the start. It largely broke out of the nan abstraction and worked with libuv directly. And its approaches to filesystem watching are a bit out-of-fashion and are worth updating (where possible) while we’re visiting this code.

The efsw approach would be absolutely perfect if one of the following two things weren’t true:

  1. efsw’s kqueue implementation (one option for file-watching on macOS) takes 500 milliseconds before it starts detecting filesystem events, even though current pathwatcher uses kqueue and doesn’t seem to suffer from this.
  2. efsw’s implementation of FSEvents (the modern, preferred option for file-watching on macOS) creates one stream for each watched folder, which makes it very easy to run afoul of the system’s enforced watcher limit. efsw does not currently detect that a stream may fail to start for this reason, so it’ll silently fail.

This is frustrating because efsw is otherwise bulletproof! My initial attempt to refactor pathwatcher worked great on macOS, but was flaky on Linux and Windows. My current attempt has the opposite problem: it’s been reliable on Linux and Windows for a while, but the last week of work has been spent fixing issues with the macOS implementation.


I’ve proceeded with the work described above and introduced new strategies for consolidating native watchers:

  • If you add a watcher on /foo/bar/baz, and we see that one is active at /foo/bar/thud, we’ll create a new native watcher at /foo/bar and make both watchers share it under the hood.
  • If you have watchers at /foo and at /foo/bar/baz, they’ll share a native watcher at /foo, no matter the order in which they were created.
  • If, in the above scenario, you remove the watcher at /foo, the native watcher will switch from /foo to /foo/bar/baz so that it can be more efficient.

These strategies reduce the number of native watchers we use, but they vastly complicate the codebase and present hard trade-offs that force us to use imperfect heuristics.

Note

Some definitions might be helpful here:

  • A native watcher is a watcher of a single path in efsw. This may or may not correlate to the single use of a function provided by the OS, depending on how efsw implements it. We wrap it on the JavaScript side with an instance of NativeWatcher.
  • A watcher is a single PathWatcher instance in JavaScript. It watches a single file or folder and is backed by a single instance of NativeWatcher that may or may not be watching the exact path it cares about. It needs to do some amount of filtering in many cases by ignoring irrelevant events reported by the native watcher. As a result, one NativeWatcher is able to provide events to many PathWatchers.

Suppose you have two watchers: one at /foo/a/b/c/d/e, and one at /foo/r/s/t/u/v. Those two watchers can share a native watcher declared at their common ancestor /foo, but that could be costly. Whenever a watcher and a native watcher aren’t watching the same path (or when a native watcher is recursive, since watchers aren’t recursive themselves), we put more of a burden on the watcher by telling it about a bunch of events that may or may not apply to its watcher. It does the job of filtering out the events that don’t match its exact path.

Sharing native watchers increases this burden; that’s the main trade-off. For this reason, there’s a heuristic in place that would prevent the scenario in the above paragraph; the shared ancestor directory would be rejected as too distant. Without a heuristic like this, all watchers on the same volume would eventually be consolidated into a single native watcher at the root of the filesystem, and that’s silly.

And, since file-watching APIs vary greatly in design across platforms, it’s ill-advised to use one fixed strategy across all platforms. If native watchers are to be reused at all, then they all must be initialized as recursive in efsw; but Linux’s inotify API doesn’t do recursion! efsw emulates it by recursively calling inotify_add_watch at each level below the watched path. Our motive for native watcher reuse was to reduce the burden on the OS, but on Linux the cure is worse than the disease. For this reason, we'd want to configure the registry not to share native watchers at all on Linux.

Windows is a bit different — every watcher created via ReadDirectoryChangesW can be configured as recursive or not. But without any specific evidence that suggests we should reuse watchers on Windows, it’s much easier for us to disable them entirely and avoid the headaches of reuse.


The unit tests are passing and I haven’t had to renegotiate pathwatcher’s API contract, but I’m not satisfied. This feels like a Rube Goldberg machine and there’s got to be a better way.

For one thing: efsw is the cause of some of our pain here. a single stream created by FSEventStreamCreate lets you monitor any number of arbitrary paths on disk, but each one creaed by efsw is in charge of monitoring only one path. I’ve written a lot of JavaScript in an effort to curtail this wastefulness, but this fix really needs to happen in efsw. I’ve filed a bug. Architecturally it’d be a challenging refactor for efsw, and I’m certainly not the best one for the job, but we’ll see what happens.

In the meantime, there are other options. All of them probably involve reverting to my solution from about a week ago before I brought the registry into the picture, since that was the most reliable solution for non-macOS platforms.

The options for macOS are as follows:

  • Eschew efsw altogether on macOS and replace it with something that’s as similar as possible to the kqueue-backed implementation that pathwatcher uses currently. Intuitively, I feel that FSEvents is the better solution, but it’s hard to square that with the idea that pathwatcher has used kqueue from the beginning and hasn’t caused us problems.
  • Eschew efsw altogether on macOS and replace it with something that uses the FSEvents API directly in the manner that I envision: emphasis on fewer watchers that each watch multiple paths.

There’s one other thing I’d love to fix: right now, every single filesystem event processed in the C++ bindings involves invoking a callback defined in JavaScript. It’s a no-brainer to batch events here (where possible) to reduce the cost of going across that bridge. nsfw does this, and the watcher API built into Pulsar envisions it as well. This is more complex on the C++ side, but it’d be irresponsible of me to kick this can down the road instead of investigating it while I’ve got the hood up.

@savetheclocktower
Copy link
Contributor Author

I went with option 2…

Eschew efsw altogether on macOS and replace it with something that uses the FSEvents API directly in the manner that I envision: emphasis on fewer watchers that each watch multiple paths.

…and it's gone beautifully. I'll try this out in Pulsar tomorrow and hopefully bring an end to this saga.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pulsar-next Related to the version of Pulsar that runs on the latest Electron
Projects
None yet
Development

No branches or pull requests

2 participants