[TRACKING] Create a more coherent stats model for libs and consumer #1463

incertum · 2023-10-31T21:06:26Z

As stated in the most recent libs stats refactor/extension PR #1433: #1433 (comment), I am creating this issue to track future refactors that, instead of adding more metrics that we urgently need, aim to create a new, more coherent stats model (software interface/API) across scap, lisinsp, and consumers such as Falco.

CC @jasondellaluce

incertum · 2023-11-14T00:36:41Z

@Andreagit97 answering here re #1433 (comment)

If the final goal is to expose only a Prometheus endpoint using something like https://github.com/jupp0r/prometheus-cpp, are we doing the right thing by exposing a vector of struct scap_stats_v2? At this point we could directly create Prometheus structs in sinsp and expose them to the consumers, so they can add them to the registry and expose them.

The Falco client shall continue to support (1) a simple file sink and (2) the internal rule (JSON formatted) and (3) a prometheus exporter was requested as additional option, but it's not yet on the concrete roadmap, maybe Falco 0.38?

As infrastructures / environments are very complex and vary among adopters I believe each of these option is valid and needed. Personally I have experienced that Prometheus only will not solve all of your problems when looking at monitoring very large and diverse environments holistically and the "bring your own metrics" over the internal Falco rule can shine in terms of simplicity and ease of integration into the existing alert output sink that you have to setup anyways if you use Falco (especially non Kubernetes environments).

That being said if the Prometheus struct will have advantages why not? Furthermore, I think the current scap stats v2 schema is not bad and we would be able to convert to Prometheus easily.

@Andreagit97 since you brought this up I was actually already thinking about How To for Falco. Let me attempt to share some more thoughts. On the one hand we have already decided that we want the Prometheus exporter, on the other hand we have also experienced firsthand that such items are a significant LOE and may be only needed or relevant for some time until the next trend arises (see for example the old k8s client or mesos etc) -- we would not know beforehand.

And maybe instead of a full client we can create a solution that pushes some JSON we already have up (I have to first explore options and learn what's possible).

In summary, I think we should discuss our output sinks in general before rushing into decisions as we have to maintain them as well (more maintenance overhead than our current internal metrics btw) and all we want is some custom metrics to check if Falco goes wonky, it's not our tools primary mission to emit metrics. Maybe @leogr you would have additional thoughts?

Let's also checkout some related projects:

OpenTelemetry should be on our radar as well and lastly for example the jaeger project changed approaches frequently wrt to the output sinks.

incertum · 2023-11-14T01:08:26Z

There was an additional comment by @Andreagit97 #1433 (review)

First the pain of scap <-> sinsp separation that doesn't really apply to the stats / metrics anyways becomes very evident. Perhaps significant renaming can help for sure.

And in addition yes that get... method will go away in the future in favor of a better approach, likely a lean stats CPP class that integrates with scap more elegantly. First however we need to agree on a design and everyone is so busy right now. That's actually why we agreed on this intermediary patch in the Oct 2023 core maintainers meeting as we will benefit from finally having these metrics in prod around the libsinsp state rather sooner than later.

Therefore, to reiterate the patch in #1433 is an intermediary step only that basically extends the buffer that was initially only created for the sinsp resource utilization metrics.

Andreagit97 · 2023-11-16T19:02:34Z

Uhm got it, they are all valid points, thank you for pointing them out!

on the custom rule for metrics, I agree with you, I like it and it seems pretty valuable also for debug purposes
on the possibility of writing a file with metrics, I have more doubts, I would like to understand if it can be really used in production and how it would be handy. That said, the cost of maintaining shouldn't be so high, so not a big deal, i would like to have some metrics on its adoption between users...
regarding Prometheus, I agree that it could require some work but I see great value in having it, i think it is one of the most widespread solutions to collect metrics in the cloud, but also here, it would be interesting to understand what the community thinks about it. And yes, I agree that this is a future step not for Falco 0.37.0

In general, I think that we need to involve more the community in these choices, the ones listed above are just my opinions but maybe they are completely wrong... one idea could be to post somewhere a question like "How would you like to collect Falco metrics? A prometheus exporter? a custom rule? or whatever" WDYT?

And in addition yes that get... method will go away in the future in favor of a better approach, likely a lean stats CPP class that integrates with scap more elegantly. First, however, we need to agree on a design and everyone is so busy right now. That's actually why we agreed on this intermediary patch in the Oct 2023 core maintainers meeting as we will benefit from finally having these metrics in prod around the libsinsp state rather sooner than later.

I'm in favor of an intermediary patch to expose new metrics what I was just saying in the PR was that IMO we need to start to have a little bit clearer what we want to expose from libs and in which form, we have many metrics in many different places and it starts to become difficult to not introducing bugs or to expose stable APIs for libs users. So the idea was, it's ok to have other metrics but let's try to find a glue between all the things we want to expose. I will try to play a little bit with the actual methods we have, to see if we can find a way to collect them all under a unique place. The aim is still to have it in 0.14.0 I will try to propose some possible approaches in the next few days. If you are on PTO I can try to lead the workflow, if you are fine with that of course. Let's see what we can do :)

incertum · 2023-11-17T02:49:09Z

[1] I always encourage collaboration with SREs to gain insights into their day-to-day approaches. For instance, writing to (log rotated) files and subsequently monitoring them for log transport options is a well-established practice in infrastructure management.

[2] Prometheus: There may be more options to support a Prometheus exporter than implementing a full Prometheus client or adding a new dependency to the project. For instance, Prometheus also supports push, not just pull, and /metrics could be exposed via atomically writing to a file with a POM-compatible string per line containing the metrics. If this approach is suitable for Falco, it would be fast, easy, and low-maintenance. Given that these potential approaches are based on some informal discussions, I would recommend conducting initial experiments.

[3]

I understand that for example scap has undergone numerous refactoring efforts this year. However, transparent plans outlining these changes were not shared or discussed beforehand. I'm curious about the rationale behind treating metrics differently in this refactoring.

Proposing that we move forward with the current PR, deploy and test it in production, and gather real-world data to inform our decision-making process. This data will help us ensure that the project's best interests are met.

I'm eager to collaborate with the team in December and beyond to develop a more cohesive stats model that aligns with our overall objectives. I recognize that the current PR involves significant cleanup of previous metrics code, which is a valuable contribution. Deferring the implementation of a comprehensive stats model until we have gathered more data will allow us to make more informed decisions and avoid overburdening this PR.

poiana · 2024-02-26T15:49:19Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

incertum · 2024-02-26T17:54:10Z

/remove-lifecycle stale

incertum · 2024-03-13T00:32:07Z

More ideas for possible follow up PRs by @FedeDP:

refactor!(libsinsp): coherent metrics interface metrics_collector class + text-based Prometheus exposition format support #1652 (comment)
refactor!(libsinsp): coherent metrics interface metrics_collector class + text-based Prometheus exposition format support #1652 (comment)

@FedeDP to address above in a follow up PR to #1652

FedeDP · 2024-03-13T15:47:11Z

Tackled second point here: #1745

incertum · 2024-05-14T19:42:38Z

/milestone 0.15.0

poiana · 2024-05-14T19:42:40Z

@incertum: The provided milestone is not valid for this repository. Milestones in this repository: [0.17.0, TBD, next-driver]

Use /milestone clear to clear the milestone.

In response to this:

/milestone 0.15.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

FedeDP · 2024-05-15T06:51:52Z

/milestone 0.17.0

incertum added the kind/feature New feature or request label Oct 31, 2023

incertum mentioned this issue Oct 31, 2023

cleanup(libsinsp): consolidation and extension of libsinsp stats / metrics sinsp_stats_v2 #1433

Merged

Andreagit97 added this to the TBD milestone Nov 20, 2023

incertum mentioned this issue Nov 27, 2023

update(userspace/falco): add libsinsp state metrics option falcosecurity/falco#2883

Merged

Andreagit97 modified the milestones: TBD, 0.14.0 Nov 28, 2023

incertum mentioned this issue Jan 27, 2024

refactor!(libsinsp): coherent metrics interface metrics_collector class + text-based Prometheus exposition format support #1652

Merged

leogr mentioned this issue Feb 19, 2024

Brainstorming: integrate Prometheus /metrics handler in Falco falcosecurity/falco#1772

Closed

poiana added the lifecycle/stale label Feb 26, 2024

poiana removed the lifecycle/stale label Feb 26, 2024

incertum mentioned this issue Mar 6, 2024

feat(metrics): sync libs metrics collector refactor falcosecurity/falco#3129

Merged

incertum closed this as completed May 14, 2024

poiana modified the milestones: TBD, 0.17.0 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRACKING] Create a more coherent stats model for libs and consumer #1463

[TRACKING] Create a more coherent stats model for libs and consumer #1463

incertum commented Oct 31, 2023

incertum commented Nov 14, 2023

incertum commented Nov 14, 2023

Andreagit97 commented Nov 16, 2023 •

edited

Loading

incertum commented Nov 17, 2023

poiana commented Feb 26, 2024

incertum commented Feb 26, 2024

incertum commented Mar 13, 2024 •

edited

Loading

FedeDP commented Mar 13, 2024

incertum commented May 14, 2024

poiana commented May 14, 2024

FedeDP commented May 15, 2024

[TRACKING] Create a more coherent stats model for libs and consumer #1463

[TRACKING] Create a more coherent stats model for libs and consumer #1463

Comments

incertum commented Oct 31, 2023

incertum commented Nov 14, 2023

incertum commented Nov 14, 2023

Andreagit97 commented Nov 16, 2023 • edited Loading

incertum commented Nov 17, 2023

poiana commented Feb 26, 2024

incertum commented Feb 26, 2024

incertum commented Mar 13, 2024 • edited Loading

FedeDP commented Mar 13, 2024

incertum commented May 14, 2024

poiana commented May 14, 2024

FedeDP commented May 15, 2024

Andreagit97 commented Nov 16, 2023 •

edited

Loading

incertum commented Mar 13, 2024 •

edited

Loading