Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preview: Use dumb-init as PID 1 #2982

Merged
merged 1 commit into from
Apr 15, 2024
Merged

preview: Use dumb-init as PID 1 #2982

merged 1 commit into from
Apr 15, 2024

Conversation

agunnerson-elastic
Copy link
Contributor

@agunnerson-elastic agunnerson-elastic commented Apr 9, 2024

build_docs.pl runs a number of git commands through its normal course of operation. After a certain point, these git commands will implicitly run git gc in the background. By design, the original git process forks off the gc child process and forgets about it. The child process will get reparented to PID 1, which is either build_docs.pl or a shell. Neither of those will handle SIGCHLD for processes they're not aware of, so the number of zombie processes grows indefinitely.

With a large number of zombie processes (80k processes observed in elastic-apps-web), metricbeat starts consuming all available CPU resources, severely degradating the performance of the cluster. This has a side effect of reducing the network performance to around ~50Mbps and increasing the I/O wait to ~30%.

dumb-init was picked because it's dumb and doesn't try to do anything fancy. There should be nothing that interferes with build_docs.pl's operation. Signals are forwarded to dumb-init's child. dumb-init will also detach itself from the controlling TTY and attach its child process instead.

Ticket: https://elasticco.atlassian.net/browse/SRVCS-1367

(2024-04-12 update: This has been successfully tested in the elastic-apps-web cluster.)

build_docs.pl runs a number of git commands through its normal course of
operation. After a certain point, these git commands will implicitly run
git gc in the background. By design, the original git process forks off
the gc child process and forgets about it. The child process will get
reparented to PID 1, which is either build_docs.pl or a shell. Neither
of those will handle SIGCHLD for processes they're not aware of, so the
number of zombie processes grows indefinitely.

With a large number of zombie processes (80k processes observed in
elastic-apps-web), metricbeat starts consuming all available CPU
resources, severely degradating the performance of the cluster. This has
a side effect of reducing the network performance to around ~50Mbps and
increasing the I/O wait to ~30%.

dumb-init was picked because it's dumb and doesn't try to do anything
fancy. There should be nothing that interferes with build_docs.pl's
operation. Signals are forwarded to dumb-init's child. dumb-init will
also detach itself from the controlling TTY and attach its child process
instead.

Ticket: https://elasticco.atlassian.net/browse/SRVCS-1367

Signed-off-by: Andrew Gunnerson <[email protected]>
Copy link

github-actions bot commented Apr 9, 2024

A documentation preview will be available soon.

Request a new doc build by commenting
  • Rebuild this PR: run docs-build
  • Rebuild this PR and all Elastic docs: run docs-build rebuild

run docs-build is much faster than run docs-build rebuild. A rebuild should only be needed in rare situations.

If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here.

Copy link
Member

@gtback gtback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great investigation and thorough explanation, @agunnerson-elastic ! I'm not familiar with dumb-init (I've often seen tini used for this purpose) but it seems to do exactly what we need.

cc: @nik9000 in case you know of anything related to docs-preview that might be relevant that I'm not aware of.

@gtback
Copy link
Member

gtback commented Apr 9, 2024

It also looks like my review isn't enough here, since I'm not an owner of this repo anymore 🙃

Copy link

@maneta maneta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you for this!!

Copy link
Member

@bmorelli25 bmorelli25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat. Thank you!

@agunnerson-elastic agunnerson-elastic merged commit 308feab into master Apr 15, 2024
5 checks passed
@agunnerson-elastic agunnerson-elastic deleted the SRVCS-1367 branch April 15, 2024 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants