Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missed probes metrics #1941

Merged
merged 8 commits into from
Aug 6, 2024
Merged

Add missed probes metrics #1941

merged 8 commits into from
Aug 6, 2024

Conversation

olsajiri
Copy link
Contributor

@olsajiri olsajiri commented Jan 8, 2024

wip needs cilium/ebpf#1295

@olsajiri olsajiri added the release-note/minor This PR introduces a minor user-visible change label Jan 8, 2024
@olsajiri olsajiri force-pushed the pr/olsajiri/missed branch 2 times, most recently from f3305dc to 97ca33b Compare March 28, 2024 13:21
Copy link

netlify bot commented Mar 28, 2024

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit 02900fe
🔍 Latest deploy log https://app.netlify.com/sites/tetragon/deploys/66b0d848057b810008ebb48b
😎 Deploy Preview https://deploy-preview-1941--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@olsajiri olsajiri force-pushed the pr/olsajiri/missed branch 2 times, most recently from 1c504b7 to d270fd3 Compare March 28, 2024 14:47
@lambdanis lambdanis added the area/metrics Related to prometheus metrics label Mar 28, 2024
pkg/metrics/probemetrics/probemetrics.go Outdated Show resolved Hide resolved
MissedProbes = metrics.NewBPFCounter(prometheus.NewDesc(
prometheus.BuildFQName(consts.MetricsNamespace, "", "missed_probes_total"),
"The total number of Tetragon probe missed per policy,probe.",
[]string{"policy", "attach"}, nil,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the attach label exactly?

I'm asking because tetragon_policy_events_total metric has a hook label with values like kprobe:__sys_setfsgid. Is attach essentially same or only similar? I'm thinking of standardizing some labels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so attach is kernel function name for single kprobe and kprobe_multi (3 functions) for kprobe multi
similar for uprobes.. I'm not sure it's good idea to standardize this, we might want to change in future ;-)

pkg/metrics/probemetrics/probemetrics.go Outdated Show resolved Hide resolved
}

var (
MissedProbes = metrics.NewBPFCounter(prometheus.NewDesc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have this metric documented in the metrics reference, we need to create a separate dummy collector that reports this metrics with example labels. See 861489a for an example, and consts.go for example labels.

This is not ideal tbh. I'm working on a better metrics library for Tetragon use cases, so hopefully soon it will be more straightforward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will add that

@olsajiri olsajiri force-pushed the pr/olsajiri/missed branch 4 times, most recently from 97b720d to 62c22cf Compare April 9, 2024 13:23
@olsajiri olsajiri force-pushed the pr/olsajiri/missed branch 3 times, most recently from 62db9fa to cc05052 Compare May 11, 2024 19:04
@olsajiri olsajiri force-pushed the pr/olsajiri/missed branch 2 times, most recently from b24768d to a87a9e2 Compare May 28, 2024 11:04
@olsajiri olsajiri force-pushed the pr/olsajiri/missed branch 5 times, most recently from 59b3588 to 6966716 Compare June 19, 2024 10:17
Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good :)

@olsajiri olsajiri force-pushed the pr/olsajiri/missed branch 2 times, most recently from 744570d to 22437f4 Compare July 11, 2024 19:24
@olsajiri olsajiri force-pushed the pr/olsajiri/missed branch 2 times, most recently from b161f68 to 9ef973e Compare July 31, 2024 09:39
@olsajiri olsajiri marked this pull request as ready for review July 31, 2024 10:15
@olsajiri olsajiri requested a review from a team as a code owner July 31, 2024 10:15
@olsajiri olsajiri requested a review from kkourt July 31, 2024 10:15
Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM.

One question what exactly is this missed metric? :)
Is it when a link exists without a program? Or is it something else?

@olsajiri
Copy link
Contributor Author

Thanks, LGTM.

One question what exactly is this missed metric? :) Is it when a link exists without a program? Or is it something else?

so it's incremented any time we don't execute the bpf program due to recursion in kernel,
like when you have 2 kprobe programs trying to run on top of each other on the same cpu

there was a case +- year ago where we were loosing exit events for this reason due to wrong
hook we used.. now if that happens again we will see that in the metrics

@olsajiri
Copy link
Contributor Author

@lambdanis could you please recheck, I changed few things on metrics side because the tetragon base changed

Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments:

  1. Ideally we should use MustNewCustomCounter and NewCustomCollector from pkg/metrics here. See how other collectors were defined in a05fbc7 or 7309c9b. If it's a trouble for some reason, I'm ok with merging as-is, but then we should have a follow-up item to refactor it.
  2. The new metrics are not documented. Using helpers from 1. should provide an easy way to include them in the generated reference.
  3. I'm not sure kprobe_multi (3 functions) is a useful label value. Seeing such metric, how do I find out which functions were missed?

@olsajiri olsajiri force-pushed the pr/olsajiri/missed branch 3 times, most recently from 80c5f4d to 42d1674 Compare August 2, 2024 08:21
@olsajiri
Copy link
Contributor Author

olsajiri commented Aug 2, 2024

A few comments:

1. Ideally we should use `MustNewCustomCounter` and `NewCustomCollector` from `pkg/metrics` here. See how other collectors were defined in [a05fbc7](https://github.com/cilium/tetragon/commit/a05fbc71ba49225c045a6354737724810409e4c7) or [7309c9b](https://github.com/cilium/tetragon/commit/7309c9ba22a2b07bae182e5ea8b10aa214be13fd). If it's a trouble for some reason, I'm ok with merging as-is, but then we should have a follow-up item to refactor it.

ok, used that in new version

2. The new metrics are not documented. Using helpers from 1. should provide an easy way to include them in the generated reference.

ok

3. I'm not sure `kprobe_multi (3 functions)` is a useful label value. Seeing such metric, how do I find out which functions were missed?

we don't have any more insight than that.. we get the counter per link/program that is attached to multiple functions,
and the missed counter spans over all attached functions.. we discussed some time ago to add a feature that would break down the counter to specific functions, but there was no need so far

@lambdanis
Copy link
Contributor

ok, used that in new version

I see you use MustNewCustomCounter but the collector is defined as a separate type, not using NewCustomCollector. It would be nice to have the new metrics documented (in docs generated by make metrics-docs), and using NewCustomCollector should make it easy (you need to pass a collectForDocs function reporting fake metrics).

At the moment we keep unloaded programs in AllPrograms.

Making AllPrograms local and locked by mutex. Adding new helpers
progsAdd and progsCleanup to add and cleanup programs (on sensor
unload) respectively.

Signed-off-by: Jiri Olsa <[email protected]>
This just matters for log output, changing from:
  time="2024-03-28T13:08:34Z" level=info msg="Loading registered BPF probe" Attach="3 functions" Program=...

into:
  time="2024-03-28T13:08:34Z" level=info msg="Loading registered BPF probe" Attach="kprobe_multi (3 functions)" Program=...

It has more info and will help in following changes where the Attach
string will serve as the metric label.

Signed-off-by: Jiri Olsa <[email protected]>
Storing link in program so we can access link's stats
and store them as metrics.

Signed-off-by: Jiri Olsa <[email protected]>
Storing ebpf.Program instance in program.Program to allow
ebpf related stats retrieval in following changes.

Signed-off-by: Jiri Olsa <[email protected]>
Adding support to set policy name for program so we can use it
as a metric label in following changes.

We are going to support missed stats retrieval for kprobes in
following changes so at ATM we set policy for kprobe programs
which are in base sensor and generic kprobe sensor.

Signed-off-by: Jiri Olsa <[email protected]>
Adding HasMissedStatsKprobe/Multi functions to detect support
of kprobe missed stats.

Checking this by searching BTF for needed structs, because AFAICS
there's no 'functional' test possible.

Signed-off-by: Jiri Olsa <[email protected]>
Adding metrics for missed runs on program and link level to kprobemetrics
package and logic to store and collect missed stats.

The missed stats are supported for all programs and kprobe/kprobe-multi links.
They are stored per 'attach name' and 'policy name'.

For programs (not just kprobes):

  tetragon_missed_prog_probes_total{attach="__x64_sys_linkat",policy="sys-linkat-passwd"} 68
  tetragon_missed_prog_probes_total{attach="acct_process",policy="__base__"} 60
  tetragon_missed_prog_probes_total{attach="sched/sched_process_exec",policy="__base__"} 64
  tetragon_missed_prog_probes_total{attach="security_bprm_committing_creds",policy="__base__"} 66
  tetragon_missed_prog_probes_total{attach="wake_up_new_task",policy="__base__"} 62

For kprobe and kprobe-multi links:

  tetragon_missed_link_probes_total{attach="__x64_sys_linkat",policy="sys-linkat-passwd"} 45
  tetragon_missed_link_probes_total{attach="acct_process",policy="__base__"} 39
  tetragon_missed_link_probes_total{attach="security_bprm_committing_creds",policy="__base__"} 43
  tetragon_missed_link_probes_total{attach="wake_up_new_task",policy="__base__"} 41

  tetragon_missed_prog_probes_total{attach="acct_process",policy="__base__"} 40
  tetragon_missed_prog_probes_total{attach="kprobe_multi (1 functions)",policy="sys-linkat-passwd"} 48
  tetragon_missed_prog_probes_total{attach="sched/sched_process_exec",policy="__base__"} 44
  tetragon_missed_prog_probes_total{attach="security_bprm_committing_creds",policy="__base__"} 46
  tetragon_missed_prog_probes_total{attach="wake_up_new_task",policy="__base__"} 42

Note changing the healthMetrics group to be created as not constrained,
so it can carry new metrics. It will be addressed in future by adding
debug metrics group.

Signed-off-by: Jiri Olsa <[email protected]>
Adding test for tetragon_missed_prog_probes_total metric.

At the moment I'm able to trigged missed event only for kprobe_multi
and for prog recursion miss. We will need to do special setup to hit
other missed counts, adding that on my todo list.

Signed-off-by: Jiri Olsa <[email protected]>
Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good :)

@olsajiri olsajiri merged commit 5ed9f84 into main Aug 6, 2024
47 checks passed
@olsajiri olsajiri deleted the pr/olsajiri/missed branch August 6, 2024 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Related to prometheus metrics release-note/minor This PR introduces a minor user-visible change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants