Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send SIGUSR1 to dnsmasq periodically #644

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

DamianSawicki
Copy link
Collaborator

This adds a feature flag --logInterval for periodic triggering dnsmasq to log its statistics by sending SIGUSR1 to it.

The new motivation for adding this feature comes from dnsmasq changes in v2.90: the new flag --max-tcp-connectionsand the related statistics of TCP connection utilisation added to the log output triggered by SIGUSR1.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 6, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: DamianSawicki
Once this PR has been reviewed and has the lgtm label, please assign mrhohn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 6, 2024
@DamianSawicki
Copy link
Collaborator Author

/test all

@DamianSawicki
Copy link
Collaborator Author

The failed test fails identically at HEAD (issue #646).

@DamianSawicki DamianSawicki marked this pull request as ready for review October 6, 2024 18:33
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 6, 2024
@DamianSawicki
Copy link
Collaborator Author

/cc @aojea

@aojea
Copy link
Member

aojea commented Oct 6, 2024

I think you have several changes in the same commit that makes it hard to review.

In addition, what is the motivation to make this configurable, I don't think we should allow people to use this feature if is not for debugging, and if that is the case, you can perfectly do it with a simple bash script

@DamianSawicki
Copy link
Collaborator Author

Thank you very much for a quick reaction, Antonio!

I think you have several changes in the same commit that makes it hard to review.

Yeah, I replace nanny.X() with if err := nanny.X(); err != nil { log } for X in {Kill, Start}, which I'm happy to move to a separate commit, but everything else belongs together: there is a functionality change with the corresponding test, and the new test case requires expanding the testing fixture.

I don't think we should allow people to use this feature

You've mentioned running a bash script. Do I understand correctly that users can also run such a script sending SIGUSR1 to a dnsmasq instance? If so, why don't you want to allow them to use this feature?

In addition, what is the motivation to make this configurable

We can hardcode some interval, but I thought it would be safer to have things configurable, even just for turning the feature on/off.

The general motivation is to improve dnsmasq observability and not to require users to run custom scripts (or reach out to their cloud providers) to gain some insights about their kubedns load. Does it make sense to you?

@aojea
Copy link
Member

aojea commented Oct 6, 2024

I can understand if you expose these metrics with a metrics/prometheus endpoints, that allows you to set alerts and integrate with common monitoring stacks ... but you still needs to evaluate the performance implications on sending a periodic signal to the process

@bowei
Copy link
Member

bowei commented Oct 7, 2024

Does dnsmasq handle syscall interruption gracefully? e.g. on receipt of signal, the syscall is canceled with EINTR.

@DamianSawicki
Copy link
Collaborator Author

@bowei

Does dnsmasq handle syscall interruption gracefully? e.g. on receipt of signal, the syscall is canceled with EINTR.

Yes, dnsmasq mentions specific actions on SIGUSR1, SIGUSR2, andSIGHUP in its linux man page and it is a very mature piece of software, so I cannot imagine it handling signals incorrectly. Just to be 100% sure, I've just checked the codebase, and there are many of instances of errno == EINTR and errno != EINTR throughout.

@aojea

you still needs to evaluate the performance implications on sending a periodic signal to the process

The statistics logged by dnsmasq on SIGUSR1 contain not just the current utilisation, but also the maximal utilisation since the last SIGUSR1, so I was thinking about using ~5 min intervals in practice (1 min at least, while 1 h would still be better than nothing). The function dump_cache handling SIGUSR1 is cheap (unless dnsmasq runs in debug mode when it does a full cache dump). With such infrequent and minimal additional load and correct handling of EINTR, performance implications should be negligible.

I can understand if you expose these metrics with a metrics/prometheus endpoints, that allows you to set alerts and integrate with common monitoring stacks

For what you describe, the first step is still the current PR, which makes dnsmasq print statistics to logs, which we can later parse and expose with metrics/prometheus endpoints or set up log-based metrics. Do you think the current PR is only sensible if extended with these follow-up steps?

@aojea
Copy link
Member

aojea commented Oct 7, 2024

I don't know, it really feels awkward running a process in production that you are sigkilling periodically ... but I don't know if this pattern is also used elsewhere

@DamianSawicki
Copy link
Collaborator Author

DamianSawicki commented Oct 7, 2024

that you are sigkilling periodically

For anyone reading this without much context, Antonio meant sending a signal periodically. The signal SUGUSR1 does not kill the process, it's just a way of communicating with it. Dnsmasq is a rather old piece of software, and that's one way of communication that it supports, but I agree that it may feel awkward today.

In any case, PRs to this repo are very rare, which means the resources are limited (two PRs with vulnerability fixes have been waiting for attention since July). The present PR (which actually was on my to-do list for months) is a rather low-hanging fruit to improve kubedns observability which is among priorities to help with supportability. Let's not rush and see what others have to say.

@aojea
Copy link
Member

aojea commented Oct 7, 2024

problems on typing fast, thanks for clarifying, I meant what Damian said 😅

@pwdtr
Copy link

pwdtr commented Oct 7, 2024

+1, Good work, Damian!

@bowei
Copy link
Member

bowei commented Oct 25, 2024

If we can do a test and include the results in the commit message to demonstrate that sending the signal is benign:

  • Send constant stream of dnsperf requests to dnsmasq
  • Enable the periodic interval of SIGUSR1

This adds a feature flag --logInterval for periodic triggering dnsmasq
to log its statistics by sending SIGUSR1 to it.

The new motivation for adding this feature comes from the recently
added dnsmasq flag --max-tcp-connections and the related statistics of
TCP connection utilisation being added to the log output triggered
by SIGUSR1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants