Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fseventsd “too many clients in system” error #185

Open
savetheclocktower opened this issue Oct 27, 2024 · 3 comments
Open

fseventsd “too many clients in system” error #185

savetheclocktower opened this issue Oct 27, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@savetheclocktower
Copy link

savetheclocktower commented Oct 27, 2024

I ran into a situation where efsw was silently failing to notice filesystem events on my Mac and I couldn't figure out why. After lots of troubleshooting, I spotted this line in Console.app:

error	17:33:24.577894-0700	fseventsd	too many clients in system (limit 1024)

I can find very, very little information about this, but it does seem to be true that fseventsd enforces a hard system-wide limit on clients. Some experiments seemed to confirm that every individual directory added via addWatch counts as a “client” for these purposes — there's a 1:1 correlation between calls to addWatch and calls to FSEventStreamCreate.

The error wouldn't actually happen until you called FSEventStreamStartefsw doesn't check its return value, but it can return false for opaque reasons:

It ought to always succeed, but in the event it does not then your code should fall back to performing recursive scans of the directories of interest as appropriate.

My use case is a Node module that wraps efsw and exposes a file-watcher API in JavaScript. It's replacing an older library that did manual filesystem watching. In a single session, as many as 100 directories might be watched, and multiple sessions can run at once. (It's used by an Electron code editor that can have any number of windows open.) So it's extremely plausible for us to run up against this limit of 1024 clients on our own.

Our immediate workaround for this problem is to employ strategies to encourage watcher reuse and make it possible for several “wrapped” watchers to use the same underlying “native” (efsw) watcher:

  • a wrapped watcher for /foo/bar/baz/ can reuse an native watcher on /foo/bar if the latter already exists
  • in the reverse scenario — a new wrapped watcher on /foo/bar when the /foo/bar/baz watcher already exists — creating a new native watcher at /foo/bar, pointing both wrapped watchers at it, and destroying the /foo/bar/baz native watcher
  • when a native watcher already exists at /foo/bar/zort, adding a wrapped watcher at /foo/bar/baz triggers a new native watcher at /foo/bar that can supply events for both paths and destroys the /foo/bar/zort native watcher

This works well enough, but it increases the complexity of the implementation quite a bit. Our watchers don't need to be recursive, but this reuse strategy means that each watcher must be initialized as a recursive watcher (in case it is reused later). Since the reuse logic lives in the JavaScript, it's also the JavaScript code's job to receive all filesystem events and decide which ones match up with active watchers.

It's also possible to reuse too much — after all, we could create one watcher at the volume root and have all of our wrapped watchers use it, but we'd be “drinking from the firehose” and asking our wrapped watchers to spend lots of time processing events that they will eventually ignore (because they happen in directories that aren't being watched).


The quickest fix here, I think, would be to check for a false return value from FSEventStreamStart and handle it by falling back to a generic watcher (or the kqueue watcher, perhaps). But that would be a pretty shallow fix.


The more thorough way to fix this would be to reduce the number of FSEventStreamRefs created. The documentation for FSEventStreamCreate suggests that there is no built-in limit to the number of paths that can be watched by a single stream; but research suggests that you'd have to restart the stream every time you change the list of watched paths.

I don't have the C++ experience to implement this, but imagine:

  • Each efsw::FileWatcher aims to manage two different FSEventStreamRefs: one for paths that need recursion and one for paths that do not need recursion
  • The first time watch is called (or, if it’s called before any paths have been added, the first time addWatch is called), one or both of theseFSEventStreamRefs is created
  • Subsequent calls to addWatch work as follows:
    • The existing stream (recursive or non-recursive as needed) is stopped
    • The new path is added to an internal array
    • A new FSEventStreamRef is created with the new list of paths and is started
    • I think it would be possible to do this in FSEvents in such a way that no filesystem events are lost during the handover; but if I'm wrong, then the swap could instead be staggered, with the new stream starting before the old one is stopped
  • The hard part here is issuing efsw::WatchIDs and figuring out how to match up filesystem events to the watchers that initiated them; easier for the non-recursive watchers (each event belongs to exactly one efsw::WatchID) than the recursive watchers (each event can belong to multiple efsw::WatchIDs)
  • Calls to removeWatch would behave similarly to calls to addWatch — create a new stream without the removed path, then swap it in.
@savetheclocktower savetheclocktower changed the title FSEvents “too many clients in system” error fseventsd “too many clients in system” error Oct 27, 2024
@SpartanJ SpartanJ self-assigned this Oct 27, 2024
@SpartanJ
Copy link
Owner

Hi @savetheclocktower!
Thanks for reporting this.
This is a tricky problem, I honestly wasn't aware of this limitation so I never thought about it while implementing it (also I implemented this a long time ago).

Initially I would want to apply the quick fix, the fallback method to KQueue / generic watcher. But I'm reading in your issue in the pulsar repository that you have a problem with an 500 ms delay, that's literally a sleep in the implementation. KQueue implementation is really old (before FSEvents existence) and I never used it much given that it's a very limiting API, documentation comment:

Kqueue implementation is limited by the maximum number of file descriptors allowed per process by the OS. In the case of reaching the file descriptors limit ( in BSD around 18000 and in OS X around 10240 ), it will fallback to the generic file watcher.

So it was extremely easy to reach the limit, so it's an inappropriate API for the job.

The sleep you saw in the Kqueue implementation is the frequency we refresh the kqueue events due to the fact that kevent call is a blocking call and a kqueue is created for each watch (I guess similar problem than the one you're mentioning about FSEventStreamCreate). I never cared much about kqueue because it's a really bad API at least for my intended usage of efsw (recursive file watchers).

I think it would be possible to do this in FSEvents in such a way that no filesystem events are lost during the handover; but if I'm wrong, then the swap could instead be staggered, with the new stream starting before the old one is stopped

This is my major worry for the proposed new implementation. I don't think implementing the changes would be much work but my main problem is that I don't have unit-tests to verify that I didn't break anything, and I know that there are several projects using the library currently which makes me very conservative with changes. efsw needs unit-tests, but I never took the time myself (too many side projects). But, if you're already testing it with some pulsar unit-tests may be you can help me giving me access to those tests so I can verify against that (+ my own testing). Also we can work on a different branch until we are confident enough that changes are stable.

I must mention that the library is designed with recursive watchers in mind, and does not expect to watch the same directory more than once, there's currently a bug in Linux implementation regarding this topic, because basically for inotify implementation I need to add a watcher for each watched directory (as you mention in your issue thread).

I strongly recommend using recursive watcher for your use case, and I must say I don't completely understand how you reach this limit, and I must clarify that I have plenty experience with editors given that I'm working on my own editor which uses efsw for the same use case. I would expect pulsar to open a watcher for each folder open by the user and the config directory and occasionally open a non-recursive watcher for the folders containing files opened externally from the project folder. At least that's what I do, so I very rarely create more than 10-20 watchers. Performance penalty for using recursive watcher and discarding events should be really low, unless you're managing that in the main/render thread given the limitations of node (worker threads are not an option?).

I can see that you ditched efsw for macOS implementation, but anyway I'm open to work on a solution.

Regards

@SpartanJ SpartanJ added the bug Something isn't working label Oct 28, 2024
@savetheclocktower
Copy link
Author

I can see that you ditched efsw for macOS implementation, but anyway I'm open to work on a solution.

The good news is that this new approach passes all our unit tests and is API-compatible with efsw::FileWatcher, so maybe it's something that could be added as an alternative approach. Here's the code — I'll clean it up a bit and reduce redundancy, but it's already working better than I'd hoped. I am not a C++ guru, so feel free to use it as a starting point for something more elegant.

This is my major worry for the proposed new implementation. I don't think implementing the changes would be much work but my main problem is that I don't have unit-tests to verify that I didn't break anything, and I know that there are several projects using the library currently which makes me very conservative with changes. efsw needs unit-tests, but I never took the time myself (too many side projects). But, if you're already testing it with some pulsar unit-tests may be you can help me giving me access to those tests so I can verify against that (+ my own testing). Also we can work on a different branch until we are confident enough that changes are stable.

Our pathwatcher tests are here, though I wouldn't call them exhaustive. Maybe they can at least give you examples of things to test. I don't have a lot of experience with C++ testing, but I know another Pulsar project uses Catch2 successfully.

I strongly recommend using recursive watcher for your use case, and I must say I don't completely understand how you reach this limit, and I must clarify that I have plenty experience with editors given that I'm working on my own editor which uses ` for the same use case.

The pathwatcher library I'm working in is a legacy library that does a lot of things inefficiently, and was probably written at a time when recursive watchers were less feasible than they are now. For instance, it's obvious that there should be one recursive watcher at each project folder’s root, yet that's not how Atom (Pulsar’s predecessor) implemented it back in 2014. We have a plan to move more of our file-watching to modern libraries — we already use nsfw in other areas of the codebase — but occasionally need something with a synchronous API for backward compatibility. I was surprised to discover how many places we still use it.

When we deprecate pathwatcher further, we will replace it with recursive watching wherever it makes sense to do so.

I would expect pulsar to open a watcher for each folder open by the user and the config directory and occasionally open a non-recursive watcher for the folders containing files opened externally from the project folder. At least that's what I do, so I very rarely create more than 10-20 watchers.

Because I'm constantly hacking on Pulsar, I am a worst-case scenario; so many of my Pulsar packages are loaded from my filesystem in “dev mode,” and in that mode Pulsar watches certain directories in each package for CSS changes so that they can be instantly applied. That's a situation that cries out for something smart enough to realize that it's watching a bunch of directories that, via symlink, share a very close ancestor in the filesystem, and to listen there instead!

Performance penalty for using recursive watcher and discarding events should be really low, unless you're managing that in the main/render thread given the limitations of node (worker threads are not an option?).

Exactly — Node is historically single-threaded and the entire ecosystem is optimized for that. Workers are far more likely to be implemented as separate processes, even in environments where worker threads are an option (since Node ~12, I think.) And even if we could do it in a separate process/thread, it'd be an awkward architecture, and quite inferior to one in which the filtering is done in the native code.

It's a bit less painful with the sort of approach that nsfw uses, since it uses batching and calls into JavaScript with an array of filesystem events. The batching limits how often the C++ will call into JS and makes it a predictable cost.

But I'm reading in pulsar-edit/pulsar#1104 in the pulsar repository that you have a problem with an 500 ms delay, that's literally a sleep in the implementation.

I noticed that — but I didn't seem to have much luck with reducing that value. It was a few days ago and I was trying lots of things, but it didn't seem to cause kqueue to catch the events that it had been missing. Luckily, the new FSEvents approach is working so well that I don't think I'll have to revisit it.

@SpartanJ
Copy link
Owner

The good news is that this new approach passes all our unit tests and is API-compatible with efsw::FileWatcher, so maybe it's something that could be added as an alternative approach. Here's the code — I'll clean it up a bit and reduce redundancy, but it's already working better than I'd hoped. I am not a C++ guru, so feel free to use it as a starting point for something more elegant.

Excellent! Your implementation looks correct, I'm glad it worked. I'll implement it and do some testing when I have some time. Thanks for taking the time to investigate the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants