tetragon: improve how we handle TIDs and GetProcessCopy() #1256

tixxdz · 2023-07-20T08:33:09Z

Goes on top of #1254

jrfastab · 2023-07-25T06:55:29Z

So the argument here is exit should always be associated with the same TID as the execve?

tixxdz · 2023-07-25T08:16:57Z

So the argument here is exit should always be associated with the same TID as the execve?

Yes this is the intention, so to sum up at execve tid==pid at our current hook time (late) there is only one single thread, then:

We improved how to track last threads https://github.com/cilium/tetragon/blob/main/bpf/process/bpf_exit.c#L10 so technically we can report during exit TID != PID (TGID)

But we still report TID==PID: https://github.com/cilium/tetragon/blob/main/bpf/process/bpf_exit.h#L59

So if you see some value in reporting TID != PID at exit, we can make it and drop the last patch of this PR?

tid != pid at exit has some value but maybe only to expert users, plus the tid field is new! in the other hand if users start to correlate the exceve and exit with PID and TID better have them equal, and that's it (simple).

For the related GetProcessCopy(): the long term goal is during kprobe if TID != PID then internally we just copy the shared fields during GetProcessCopy(), but for all the rest we have to set it at real time from BPF... this should reduce data copied during GetProcessCopy() but may add more cpu cycles for decoding non shared fields which we have to do anyway... and we already have an answer by masking some fields with --enable-process-cred --enable-process-ns flags to not decode nor report everything...

So here I will start incrementally isolating fields and making it clear in the code what fields we share and what not and assert that execve==exit events have some equal fields, but for other events we do case by case... to improve process cache handling and make it easy for us to remember :-D ...

tixxdz · 2023-07-25T08:19:12Z

@kkourt @olsajiri opinion on above?

jrfastab · 2023-07-26T23:33:07Z

Its not clear to me how to create "shared" fields until we rip the protobuf structs out of the core codes. I think we want to do this regardless though.

pkg/grpc/exec/exec.go

kkourt

So I guess our model is that ProcessInternal is immutable, and if it needs to be modified, we do a copy?

jrfastab · 2023-07-27T18:50:57Z

@kkourt ProcessInternal can be mutable but *process can not be changed while being handled by JSON/GRPC writers or else we risk corrupting the output.

And (not sure we do this now) ProcessInternal really shouldn't be written/read without using some atomic, read_once, or locking primitive because multiple go routines might be working on it. Or if its a new object not pushed into a cache yet its safe because it has just a single go routine with access to it.

jrfastab · 2023-07-27T22:01:07Z

I merged the patch 1 from the other PR as we discuss this.

kkourt · 2023-07-28T06:16:15Z

@kkourt ProcessInternal can be mutable but *process can not be changed while being handled by JSON/GRPC writers or else we risk corrupting the output.

And (not sure we do this now) ProcessInternal really shouldn't be written/read without using some atomic, read_once, or locking primitive because multiple go routines might be working on it. Or if its a new object not pushed into a cache yet its safe because it has just a single go routine with access to it.

Makes sense. Might worth reviewing the code and verifying that things work as above. ProcessInternal does have a mu sync.Mutex so all accesses need to be done with the mutex held.

tixxdz · 2023-07-29T17:23:05Z

@kkourt ProcessInternal can be mutable but *process can not be changed while being handled by JSON/GRPC writers or else we risk corrupting the output.
And (not sure we do this now) ProcessInternal really shouldn't be written/read without using some atomic, read_once, or locking primitive because multiple go routines might be working on it. Or if its a new object not pushed into a cache yet its safe because it has just a single go routine with access to it.

Makes sense. Might worth reviewing the code and verifying that things work as above. ProcessInternal does have a mu sync.Mutex so all accesses need to be done with the mutex held.

@jrfastab @kkourt so that mutex is used to serialize:

retry / update processinternal.process.pod
get copy of processinternal.process to stream it to gRPC
annotate processinternal.process with namespace and capabilities

So not for processinternal but for the embedded protobuf process struct.

We have to do several things here:

Add another simple process struct inside processinternal and use it as the real holder or holder for the non shared fields between threads instead of the protobuf one https://github.com/cilium/tetragon/blob/main/pkg/process/process.go#L42
Always collect process credentials and namespace during exec/clone and store directly in this new simple form (caps as int64...) inside process cache, get rid of of these protobuf capabilities and namespaces https://github.com/cilium/tetragon/blob/main/pkg/process/process.go#L44 too. When exporting events we chose what fields we export... --enable-process-creds etc or have better fields filters...

Decoding events to protobuf only when exporting events may add cpu overhead as now we decode once and use for both execve / exit , kprobes , etc, this also may do unnecessary GetProcessCopy() on all fields since we have to update the copies anyway for non main threads... (this is still in progress). so I think the schema below could work:

We may end up with:

// ProcessInternal is the internal representation of a process.
// nolint:revive // This is an acceptable case of "stuttering" since the name "Internal"
// wouldn't make much sense by itself.
type ProcessInternal struct {
        // internal process simple representation holder of the possible non shared information
        // that is the per thread info
        mainThread processSimple

	// muproto protects the modifications to cached proto process.
	muproto sync.Mutex
        // decoded protobuf fields externally visible process struct, decode once and serve many
	process *tetragon.Process

	// garbage collector metadata
	color  int
        refcnt uint32
}

The per thread fields are stored in the mainThread object. When we operate on the main thread, execve/exit, kprobe, tracepoints etc we compare the thread ID, if main thread use the cached protobuf fields. If not main thread then compare what differs from the mainThread struct and decode that (non shared) and use it.

This could make processInternal somehow immutable as pointed by @kkourt until we start to hook in the sensors call sites that may change those fields...

tixxdz · 2023-07-29T17:24:44Z

issue here: #1290

tixxdz · 2023-07-29T17:31:59Z

@kkourt ProcessInternal can be mutable but *process can not be changed while being handled by JSON/GRPC writers or else we risk corrupting the output.

@jrfastab what are the scenarios where processInternal can be mutable? just ensuring I didnt miss any. IIRC the matchCapabilitiesChange updates the bpf execve_map but not the userspace process cache and that's part of kprobe not a main sensor.

I think if we want to reflect this we have first to hook by default the setuid() family as a main sensor too, update the credentials on the bpf execve_map then reflect this on the user space process cache too... we do it step by step, so we don't get tricked by some exploits etc...

olsajiri · 2023-08-01T08:15:11Z

I'm probably missing something, but I'm getting bit lost on how the TID works now ;-)

IIRC we have following events flow:

- fork      (TID == PID)
- exec      (TID == PID)
- [ku]probe or tracepoint (we record TID and create extra copy of Process in user space with this TID for report)
- exit      (TID == PID)

could we have comment like that somewhere explaining the current situation?

also maybe we could remove some of the tid fields from the kernel->user events,
that gets initialized as pids anyway, seems confusing

tixxdz · 2023-08-01T08:19:23Z

I'm probably missing something, but I'm getting bit lost on how the TID works now ;-)

IIRC we have following events flow:
- fork      (TID == PID)
- exec      (TID == PID)
- [ku]probe or tracepoint (we record TID and create extra copy of Process in user space with this TID for report)
- exit      (TID == PID)
could we have comment like that somewhere explaining the current situation?

also maybe we could remove some of the tid fields from the kernel->user events, that gets initialized as pids anyway, seems confusing

indeed will do

jrfastab · 2023-08-02T18:24:48Z

Its not clear to me what the issue created is about? In general IMO we rely to heavily on protobufs in lower edges of stack, but that is about optimizing performance and json encoder.

The mutex in processInternal is there to serialize writers/readers of processInternal. Somehow the 3 bullets above were lost on me.

tixxdz · 2023-08-18T16:44:13Z

I'm probably missing something, but I'm getting bit lost on how the TID works now ;-)
IIRC we have following events flow:
- fork      (TID == PID)
- exec      (TID == PID)
- [ku]probe or tracepoint (we record TID and create extra copy of Process in user space with this TID for report)
- exit      (TID == PID)
could we have comment like that somewhere explaining the current situation?

@olsajiri I did extensively last patch with commit log, please have a look

also maybe we could remove some of the tid fields from the kernel->user events, that gets initialized as pids anyway, seems confusing

I agree that we should remove it as some point, but for now it is useful as it helps debug things and ensure our model is right, and we are still not finished with handling all threads please take a look at the last patch commit log and to this comment #1256 (comment)

netlify · 2023-08-18T16:46:05Z

✅ Deploy Preview for tetragon ready!

Name	Link
🔨 Latest commit	`9c43b9d`
🔍 Latest deploy log	https://app.netlify.com/sites/tetragon/deploys/650b65e9e12a760008dc5dc2
😎 Deploy Preview	https://deploy-preview-1256--tetragon.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

tixxdz · 2023-08-18T16:47:02Z

Its not clear to me what the issue created is about? In general IMO we rely to heavily on protobufs in lower edges of stack, but that is about optimizing performance and json encoder.

The mutex in processInternal is there to serialize writers/readers of processInternal. Somehow the 3 bullets above were lost on me.

@jrfastab fair enough, I updated last patch with extensive commit log about what needs to be fixed, and when doing those we will fix it if we have to.

pkg/grpc/exec/exec.go

tixxdz · 2023-09-11T12:59:44Z

Ok @olsajiri updated based on all comments, and yes it turned more work with documenting all the stuff that it is still missing as you suggested. so basically we are not finished with all per thread transition hence some extra debug and warning messages... here and there.

pkg/grpc/exec/exec.go

pkg/process/process.go

jrfastab · 2023-09-20T17:15:27Z

I find this commit message to be confusing,

 We want to correlate the exit and exec events, for this the final TID
of the exit event _must matches_ the PID and TID of exec / clone events
since we handle all these by the main thread.

The tetragonEvent.Process of the exit event is constructed either:

  1. When looking up the process by its PID from user space cache
     then we get the process that was pushed during exec or clone.

  2. Out of order exec event, and this is same as (1).

For all these cases the TID of the exit should be _automatically set_ to
the TID and PID of the exec/clone event. To achieve this we ensure that
clone and exec events have their TID also set to PID.

* clone events must store the TID in the process cache directly and
  we ensure that it is the PID as we generate only one event that
  is the thread leader.

* exec event we ensure and automatically set TID to PID too and insert it
  into the process cache as it is the main thread doing the execve call.

However from bpf side:
We also ensure that exit events on their own must send bpf PID == TID,
if not let's print a warning that is useful to bisect things.

This helps to assert the different variants of tetragon and that
sensors or bpf part did not change on behalf of user space and cache
handling, as again we handle only one main thread in the bpf side and
also in process cache, and we still did not complete all the per-thread
fields transition.

My sort of understanding is something like,

Exit events should have TID==PID and the {TID,PID} in the exit event must match the {TID,PID} pair from the exec event. They must match because its how we link the exit event to the exec event. If TID!=PID or the Exit{TID,PID} pair does not match an Exec{TID,PID} pair then there is a bug.

This patch adds extra logic to WARN on conditions where TID!=PID to aid debugging and catch this unexpected case. Typically this indicates a bug either in BPF or userspace caching logic. When this condition is encountered we attempt to correct it by setting the TID=PID.

jrfastab · 2023-09-20T17:17:18Z

pkg/process/process.go

+			"event.process.binary":  binary,
+			"event.process.exec_id": execID,
+			"event.parent.exec_id":  parentExecID,
+		}).Warn("ExecveEvent: process PID and TID mismatch")


These warnings need to be metric error counters so we can actually find them. No one will notice a rare and random warning in the logs. (Note I wouldn't block this PR on the metric implementation but I do think we should get in soon to catch this.)

So all fixed except for this adding metrics, will do in separate PR, so I don't mess up this one ;-)

pkg/grpc/exec/exec.go

bpf/process/bpf_execve_event.c

bpf/process/bpf_exit.h

bpf/process/bpf_fork.c

jrfastab

My comments aside looks good to me. Few small cleanups and I think we can get this merged.

Exit events should have TID==PID at same time we want to correlate the {TID,PID} of the exit event with the {TID,PID} pair from the exec event. They must match because its how we link the exit event to the exec one. If TID != PID or the Exit{TID,PID} pair does not match Exec{TID,PID} pair then this is a bug. This patch adds extra logic to WARN on conditions where TID!=PID to aid debugging and catch this unexpected case. Typically this indicates a bug either in BPF or userspace caching logic. When this condition is encountered we attempt to correct it by setting the TID=PID on Clone and Exec events. Signed-off-by: Djalal Harouni <[email protected]>

Remove the unnecessary GetProcessCopy() and use the previous refernce of the process. That reference if found in the process cache already has TID set when it was pushed during exec or clone. Signed-off-by: Djalal Harouni <[email protected]>

Our current rules now for how we collect thread IDs are: During bpf we send both TID and PID where: - fork TID == PID - exec TID == PID - [ku]probe or tracepoint TID could be different as per thread ID. - exit TID == PID => this is to match the exec evnet. At user space we have one entry that is the thread leader collected either during clone or exec: - fork TID == PID (asserts TID == PID received from BPF side) - exec TID == PID ditto - exit TID == PID => this is to match the exec evnet. - [ku]probe or tracepoint We make a copy of the process that is the thread leader in the process cache then update its TID that was recorded from bpf side. The copy is needed so we don't corrupt gRPC handling. Now this is by far complete, future fixes should include: - Record the capabilities and namespaces per thread in BPF side for [ku]probe and tracpoints and ensure to not overwrite the fields of the thread leader that are in the execve_map in bpf side or the user space process cache with fields of another thread, as that cache contains only one thread the leader. [Need to recheck sources] - Also ensure that [ku]probe and tracpoints events do use the per thread capabilities and namespaces fields collected from bpf side instead of the fields of the leader that is in the process cache which were collected during exec or during match filters... and not at current time. - Ensure that we always collect thread leader fields and we cache them in our shadow state in execve_map and user space process cache, reguardless of --enable-process-creds and a like flags. - When all these fixed and the model is clear, maybe by then we can remove the extra recording of TIDs from bpf side during fork/clone, exec and exit as we should by then asserted our model. Note that sending the TID that equals PID on clone exec and exit from BPF side and the assertion on the user space helps to catch errors for other Tetragon variants that use the OSS version as a base with custom sensors. The downside of this is we are just sending an extra 4bytes from bpf which is also fine. Signed-off-by: Djalal Harouni <[email protected]>

tixxdz · 2023-09-20T21:38:30Z

I find this commit message to be confusing,

 We want to correlate the exit and exec events, for this the final TID
of the exit event _must matches_ the PID and TID of exec / clone events
since we handle all these by the main thread.

The tetragonEvent.Process of the exit event is constructed either:

  1. When looking up the process by its PID from user space cache
     then we get the process that was pushed during exec or clone.

  2. Out of order exec event, and this is same as (1).

For all these cases the TID of the exit should be _automatically set_ to
the TID and PID of the exec/clone event. To achieve this we ensure that
clone and exec events have their TID also set to PID.

* clone events must store the TID in the process cache directly and
  we ensure that it is the PID as we generate only one event that
  is the thread leader.

* exec event we ensure and automatically set TID to PID too and insert it
  into the process cache as it is the main thread doing the execve call.

However from bpf side:
We also ensure that exit events on their own must send bpf PID == TID,
if not let's print a warning that is useful to bisect things.

This helps to assert the different variants of tetragon and that
sensors or bpf part did not change on behalf of user space and cache
handling, as again we handle only one main thread in the bpf side and
also in process cache, and we still did not complete all the per-thread
fields transition.

My sort of understanding is something like,

Exit events should have TID==PID and the {TID,PID} in the exit event must match the {TID,PID} pair from the exec event. They must match because its how we link the exit event to the exec event. If TID!=PID or the Exit{TID,PID} pair does not match an Exec{TID,PID} pair then there is a bug.

This patch adds extra logic to WARN on conditions where TID!=PID to aid debugging and catch this unexpected case. Typically this indicates a bug either in BPF or userspace caching logic. When this condition is encountered we attempt to correct it by setting the TID=PID.

Thanks @jrfastab I changed the commit log to what you suggested and also inside the code ;-)

tixxdz · 2023-09-20T21:40:34Z

My comments aside looks good to me. Few small cleanups and I think we can get this merged.

@olsajiri I did all your changes, plus last requests from @jrfastab too! so I will merge this one if it's green ;-)

tixxdz requested a review from a team as a code owner July 20, 2023 08:33

tixxdz requested a review from tpapagian July 20, 2023 08:33

tixxdz marked this pull request as draft July 20, 2023 08:33

tixxdz force-pushed the pr/tixxdz/use-tid-directly branch 2 times, most recently from 367e603 to 0f6a5c5 Compare July 21, 2023 17:22

tixxdz marked this pull request as ready for review July 21, 2023 18:00

tixxdz requested review from jrfastab and olsajiri July 21, 2023 18:00

tixxdz requested a review from kkourt July 25, 2023 08:18

jrfastab reviewed Jul 26, 2023

View reviewed changes

pkg/grpc/exec/exec.go Show resolved Hide resolved

jrfastab reviewed Jul 27, 2023

View reviewed changes

pkg/grpc/exec/exec.go Outdated Show resolved Hide resolved

jrfastab reviewed Jul 27, 2023

View reviewed changes

pkg/grpc/exec/exec.go Outdated Show resolved Hide resolved

kkourt reviewed Jul 27, 2023

View reviewed changes

tixxdz force-pushed the pr/tixxdz/use-tid-directly branch from 49cb54d to 53b9644 Compare July 29, 2023 13:36

tixxdz mentioned this pull request Jul 29, 2023

tetragon: revisit process cache fields and locking #1290

Open

tixxdz force-pushed the pr/tixxdz/use-tid-directly branch from 53b9644 to a4fa3fc Compare August 18, 2023 16:41

olsajiri reviewed Aug 21, 2023

View reviewed changes

pkg/grpc/exec/exec.go Show resolved Hide resolved

pkg/grpc/exec/exec.go Show resolved Hide resolved

tixxdz force-pushed the pr/tixxdz/use-tid-directly branch from 52277cc to 26d09a1 Compare September 11, 2023 12:53

tixxdz added the release-note/minor This PR introduces a minor user-visible change label Sep 11, 2023

tixxdz force-pushed the pr/tixxdz/use-tid-directly branch from 26d09a1 to e44b014 Compare September 12, 2023 16:18

tixxdz mentioned this pull request Sep 12, 2023

tetragon:process_exec: display uids/gids credentials and detect privileged execution #1296

Merged

4 tasks

olsajiri reviewed Sep 13, 2023

View reviewed changes

pkg/grpc/exec/exec.go Outdated Show resolved Hide resolved

tixxdz force-pushed the pr/tixxdz/use-tid-directly branch from e44b014 to 58a58ef Compare September 13, 2023 14:12