Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the capturing of environment variables #327

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

RononDex
Copy link

Use case behind this PR

This is part of my bachelor thesis at the University of Applied Sciences North-Western Switzerland (FHNW).

We are exploring the usage of the opentelemetry-ebpf-profiler for HPC environments (specifically for the ESA Euclid mission). Since there are many different jobs running in parallel on HPC clusters, admins need a way to differentiate profiling data between missions and pipelines to figure out who or what is causing a congestion of the system.

For this purpose Environment variables are passed to executables in HPC environments (for example by SLURM or custom pipeline processing management systems). Being able to group and filter by some selected environment variables is therefor crucial for HPC clusters and developers that are running their code on said clusters.

Implementation

The environment variables are read from /proc/<pid>. I noticed that there is a way to extend captured profiling data with custom metadata using the SampleAttrProducer.ExtraSampleAttrs. However, since the ProcessManager already captures PID specific data from /proc/<pid> I thought it might be a good idea to capture the environment variable directly in there.

Currently the list of captured environment variables is static at compile time and defined in the ProcessManager

Todo's / questions

  • How does this implementation hold up with the existing architecture / philosophy of opentelemetry-ebpf-profiler?
  • What would be the best way to make the list of captured environment variables configurable (using a configuration file for example?)
  • How can my code be improved? (I am new to the language Go)
  • Currently only the otlp reporter was adjusted to include the env vars as attributes on a Sample level. Other reporters would need to be adjusted too.

Store Environment variables in ProcessManager and make it available
in all vertical layers

Support capturing env variables
@RononDex RononDex requested review from a team as code owners January 28, 2025 10:53
Copy link

linux-foundation-easycla bot commented Jan 28, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

Comment on lines +42 to +49
type ProcessManagerConfig struct {
extractEnvVars []string
}

var pm_cfg = ProcessManagerConfig{extractEnvVars: []string{
"PIPELINE_PPOID",
"PIPELINE_SOFTWARENAME",
"PIPELINE_JOBID"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using a map for faster (O(1)) lookup.
Also, this needs to be configurable, as you already said.
This map of env vars could be part of tracer/Config and be passed to processmanager.New().
Then it needs a configuration option exposed to the user, see cli_flags.go.

The scanning of the environ file should only happen if that map of env variables is not empty.

@rockdaboot
Copy link
Contributor

👍 Having the option of reporting a selected set of env variables per process/sample makes sense to me. It might blow up the size of profiling messages, but IMO that can be discussed outside this PR.

How does this implementation hold up with the existing architecture / philosophy of opentelemetry-ebpf-profiler?

IMO fits into the current implementation (ProcessManager.updatePidInformation().

What would be the best way to make the list of captured environment variables configurable (using a configuration file for example?)

A configuration file is an option, but might be over-engineered. If it just a small set of variable names, it can be configured as a comma-separated list of strings.

How can my code be improved? (I am new to the language Go)

Reviewers will suggest improvements during the following review process.

Currently only the otlp reporter was adjusted to include the env vars as attributes on a Sample level. Other reporters would need to be adjusted too.

Reviewers will ask you for tests and possibly missing parts.

@florianl
Copy link
Contributor

While there is interest to collect and forward more information about a process, I think, the core functionality of the eBPF profiler should stay around unwinding and providing stack traces.

To enrich process information, there are several options. Did you consider

Overall, I think, it should not be the eBPF profilers task to collect as much environmental information as possible to forward it.

@christos68k
Copy link
Member

christos68k commented Jan 28, 2025

While there is interest to collect and forward more information about a process, I think, the core functionality of the eBPF profiler should stay around unwinding and providing stack traces.

To enrich process information, there are several options. Did you consider

It seems that the OpenTelemetry collector supports reading environment variables directly in its configuration file, and that can be leveraged from an existing processor, such as attributesprocessor to insert environment variables as attributes.

So I agree with @florianl that this is probably the way to go, as we're trying to remove extraneous functionality that doesn't strictly depend on profiling internals. In some cases it's important to collect information inside the profiling agent, an example would be process name or executable name, as delayed collection may result in missing that metadata when the process exits.

For environment variables however this should not be a problem. I looked at the PR after writing this and noticed you're doing per-PID collection which will not work with the OTel collector configuration file (as that's per profiling agent process) and is subject to the same concerns (timing, process exit) as the other per-PID bits of metadata we collect. For reliable collection, the approach you're using in this PR would then be the way to go.

To summarize, I think we should discuss this in the next Profiling SIG and decide if per-PID environment variables are something we want to collect as part of the agent. We would then also need to decide how to document this functionality, since it's not part of existing OTel semantic conventions (one option would be to make it part of profiles specification, another to introduce a new semantic convention key for process environment).

splittedVars := strings.Split(string(envVars), "\000")
fmt.Println("EnvVars for PID" + strconv.Itoa(int(pid)))
for _, envVar := range splittedVars {
keyValuePair := strings.Split(envVar, "=")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct as values can themselves contain "=".

Suggested change
keyValuePair := strings.Split(envVar, "=")
keyValuePair := strings.SplitN(envVar, "=", 2)

@RononDex
Copy link
Author

Thanks for all your feedback. Sadly OpenTelemetry Collector did not seem like a viable option. The OpenTelemetry Collector does not have to run on the same machine as this eBPF profiler, making it impossible to get the needed data in the OpenTelemetry collector.

The eBPF profiler seemed the correct layer to implement this feature because of this.

In HPC environments we usually have multiple executable (processes, executing different binaries) spread over multiple machines / pilots for the same Job. Being able to group and filter those to properly understand code paths, hotspots and bottlenecks environment variables on a PID level are necessary.

@christos68k
Copy link
Member

Thanks for all your feedback. Sadly OpenTelemetry Collector did not seem like a viable option. The OpenTelemetry Collector does not have to run on the same machine as this eBPF profiler, making it impossible to get the needed data in the OpenTelemetry collector.

The eBPF profiler seemed the correct layer to implement this feature because of this.

To clarify: The eBPF profiler will run as part of the OTel collector (as a receiver) in the near future and this will be the standard supported configuration. It's not certain at this point in time whether we'll continue to support a standalone profiling agent. However, this doesn't really affect the implementation of collecting per-PID metadata, as a similar approach to the one implemented in this PR will need to be used if reliability is required.

@florianl
Copy link
Contributor

Sadly OpenTelemetry Collector did not seem like a viable option. The OpenTelemetry Collector does not have to run on the same machine as this eBPF profiler, making it impossible to get the needed data in the OpenTelemetry collector.

Can you be more precise on this? I think, in the mid to long run, there will be no independent eBPF profiler binary as it will be part of the OTel collector.

a costum implementation of package reporter. Check out parca-agent and how they implement their individual package reporter to fit their backend.

Can you point out, why using a similar approach, like parca-agent, which uses the functionality of the eBPF profiler but with custom extensions, is not a viable option?

@RononDex
Copy link
Author

Can you be more precise on this? I think, in the mid to long run, there will be no independent eBPF profiler binary as it will be part of the OTel collector.

Ah I did not know that. I thought the ebpf profiler and the Otel collector will be two seperate processes / infrastructure layers.

Can you point out, why using a similar approach, like parca-agent, which uses the functionality of the eBPF profiler but with custom extensions, is not a viable option?

I have tried parca, however it seemed to require a paid license for most functionality and I was unable to get it to work with another backend / frontend (other than parca).

@florianl
Copy link
Contributor

Can you point out, why using a similar approach, like parca-agent, which uses the functionality of the eBPF profiler but with custom extensions, is not a viable option?

I have tried parca, however it seemed to require a paid license for most functionality and I was unable to get it to work with another backend / frontend (other than parca).

I'm not asking to use parca (but also feel free to use it if it fits your need), but more like following the approach of the parca-agent. parca-agent uses eBPF profiler as Go module and builds for their custom needs around it.

Before open sourcing, the agent used to also collect more information about the environment. But this functionality got removed, see #90, as this can be done with other approaches.

@christos68k
Copy link
Member

christos68k commented Jan 28, 2025

Before open sourcing, the agent used to also collect more information about the environment. But this functionality got removed, see #90, as this can be done with other approaches.

We need to distinguish between per-agent-process environment collection and per-trace-event-PID (partial)environment collection. Per-agent-process environment collection can take place in a post-processing step in delayed fashion, e.g. in an OTel collector processor.

Per-trace-event-PID (partial)environment collection is time-sensitive and, if reliability is required, needs to be synchronized in the same manner as the other per-trace-event-PID bits of metadata the agent collects.

EDIT: Added a discussion point to next week's Profiling SIG agenda

@rockdaboot
Copy link
Contributor

In HPC environments we usually have multiple executable (processes, executing different binaries) spread over multiple machines / pilots for the same Job. Being able to group and filter those to properly understand code paths, hotspots and bottlenecks environment variables on a PID level are necessary.

You only need PID level env variables if you run the identical executable as multiple processes on the same machine at the same time. That's how I understood the PR description, but here you tell a slightly different story ("processes, executing different binaries").

Because you can already distinguish between different hosts and between different executables from the profiling data.

Can you clarify? Are you running the same executable as several processes in parallel on a machine?

@florianl
Copy link
Contributor

Per-trace-event-PID (partial)environment collection is time-sensitive and, if reliability is required, needs to be synchronized in the same manner as the other per-trace-event-PID bits of metadata the agent collects.

If there is a need for more and custom time critical collection of PID information, then package processmanager should provide a hook, that is triggered on func (*ProcessManager) SynchronizeProcess(process.Process). In combination with reporter.ExtraSampleAttrProd this would allow catching, updating and reporting such custom information.

@RononDex
Copy link
Author

Yes, often the same executable will be started hundreds if not even thousands of times with different parameters for the same job, this can happen on the same machine or be distributed over several machines. It depends on the how the HPC cluster is setup.

Maybe to provide a use case: The ESA Euclid telescope gathers 600MP photos often containing thousands of galaxies, for each of those galaxies a new process is then kicked off to do calculations (very simply put). This can end up with thousands of processes being started with the same executable.

Also different processing pipelines running at the same time may use the identical executables (for different patches of the sky for example, each patch being a processing job, each again kicking off thousands of new processes).

@fabled
Copy link
Contributor

fabled commented Jan 28, 2025

Do note that in some scenarios environment may contain "secret" tokens and it probably is not good to send full unfiltered environment. I believe some users will be very sensitive on what data is collected and sent out. So it probably would be better to keep this off by default, and enable it by giving a whitelist of the environment variables that will be sent out. That should also limit the amount of data that gets sent out.

@rockdaboot
Copy link
Contributor

Do note that in some scenarios environment may contain "secret" tokens and it probably is not good to send full unfiltered environment. I believe some users will be very sensitive on what data is collected and sent out. So it probably would be better to keep this off by default, and enable it by giving a whitelist of the environment variables that will be sent out. That should also limit the amount of data that gets sent out.

That is what the PR does. It filters for a given set of variable names. If that list is empty, no data would be sent.

This ensures that only 2 substrings are returned (in case the value has
a "=" too)
@brancz
Copy link
Contributor

brancz commented Jan 29, 2025

Happy to take this conversation elsewhere (eg. Parca Discord), but wanted to chime in and say that Parca is in no way proprietary or needs a license for anything (everything under the Parca umbrella is Apache2 license with the exception of some of the eBPF bits which are GPL). Would love to understand what makes it seem like that's not the case and resolve that!

On a related note, Parca-Agent already does support adding a variety of metadata on a per-sample basis. See the metadata labels documentation, happy to help you put together a configuration that suits your needs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants