Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tetragon: improve how we handle TIDs and GetProcessCopy() #1256

Merged
merged 3 commits into from
Sep 20, 2023

Conversation

tixxdz
Copy link
Member

@tixxdz tixxdz commented Jul 20, 2023

Goes on top of #1254

@tixxdz tixxdz requested a review from a team as a code owner July 20, 2023 08:33
@tixxdz tixxdz requested a review from tpapagian July 20, 2023 08:33
@tixxdz tixxdz marked this pull request as draft July 20, 2023 08:33
@tixxdz tixxdz force-pushed the pr/tixxdz/use-tid-directly branch 2 times, most recently from 367e603 to 0f6a5c5 Compare July 21, 2023 17:22
@tixxdz tixxdz marked this pull request as ready for review July 21, 2023 18:00
@tixxdz tixxdz requested review from jrfastab and olsajiri July 21, 2023 18:00
@jrfastab
Copy link
Contributor

So the argument here is exit should always be associated with the same TID as the execve?

@tixxdz
Copy link
Member Author

tixxdz commented Jul 25, 2023

So the argument here is exit should always be associated with the same TID as the execve?

Yes this is the intention, so to sum up at execve tid==pid at our current hook time (late) there is only one single thread, then:

We improved how to track last threads https://github.com/cilium/tetragon/blob/main/bpf/process/bpf_exit.c#L10 so technically we can report during exit TID != PID (TGID)

But we still report TID==PID: https://github.com/cilium/tetragon/blob/main/bpf/process/bpf_exit.h#L59

So if you see some value in reporting TID != PID at exit, we can make it and drop the last patch of this PR?

tid != pid at exit has some value but maybe only to expert users, plus the tid field is new! in the other hand if users start to correlate the exceve and exit with PID and TID better have them equal, and that's it (simple).

For the related GetProcessCopy(): the long term goal is during kprobe if TID != PID then internally we just copy the shared fields during GetProcessCopy(), but for all the rest we have to set it at real time from BPF... this should reduce data copied during GetProcessCopy() but may add more cpu cycles for decoding non shared fields which we have to do anyway... and we already have an answer by masking some fields with --enable-process-cred --enable-process-ns flags to not decode nor report everything...

So here I will start incrementally isolating fields and making it clear in the code what fields we share and what not and assert that execve==exit events have some equal fields, but for other events we do case by case... to improve process cache handling and make it easy for us to remember :-D ...

@tixxdz tixxdz requested a review from kkourt July 25, 2023 08:18
@tixxdz
Copy link
Member Author

tixxdz commented Jul 25, 2023

@kkourt @olsajiri opinion on above?

@jrfastab
Copy link
Contributor

Its not clear to me how to create "shared" fields until we rip the protobuf structs out of the core codes. I think we want to do this regardless though.

pkg/grpc/exec/exec.go Outdated Show resolved Hide resolved
pkg/grpc/exec/exec.go Outdated Show resolved Hide resolved
Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess our model is that ProcessInternal is immutable, and if it needs to be modified, we do a copy?

@jrfastab
Copy link
Contributor

jrfastab commented Jul 27, 2023

@kkourt ProcessInternal can be mutable but *process can not be changed while being handled by JSON/GRPC writers or else we risk corrupting the output.

And (not sure we do this now) ProcessInternal really shouldn't be written/read without using some atomic, read_once, or locking primitive because multiple go routines might be working on it. Or if its a new object not pushed into a cache yet its safe because it has just a single go routine with access to it.

@jrfastab
Copy link
Contributor

I merged the patch 1 from the other PR as we discuss this.

@kkourt
Copy link
Contributor

kkourt commented Jul 28, 2023

@kkourt ProcessInternal can be mutable but *process can not be changed while being handled by JSON/GRPC writers or else we risk corrupting the output.

And (not sure we do this now) ProcessInternal really shouldn't be written/read without using some atomic, read_once, or locking primitive because multiple go routines might be working on it. Or if its a new object not pushed into a cache yet its safe because it has just a single go routine with access to it.

Makes sense. Might worth reviewing the code and verifying that things work as above. ProcessInternal does have a mu sync.Mutex so all accesses need to be done with the mutex held.

@tixxdz tixxdz force-pushed the pr/tixxdz/use-tid-directly branch from 49cb54d to 53b9644 Compare July 29, 2023 13:36
@tixxdz
Copy link
Member Author

tixxdz commented Jul 29, 2023

@kkourt ProcessInternal can be mutable but *process can not be changed while being handled by JSON/GRPC writers or else we risk corrupting the output.
And (not sure we do this now) ProcessInternal really shouldn't be written/read without using some atomic, read_once, or locking primitive because multiple go routines might be working on it. Or if its a new object not pushed into a cache yet its safe because it has just a single go routine with access to it.

Makes sense. Might worth reviewing the code and verifying that things work as above. ProcessInternal does have a mu sync.Mutex so all accesses need to be done with the mutex held.

@jrfastab @kkourt so that mutex is used to serialize:

  1. retry / update processinternal.process.pod
  2. get copy of processinternal.process to stream it to gRPC
  3. annotate processinternal.process with namespace and capabilities

So not for processinternal but for the embedded protobuf process struct.

We have to do several things here:

  1. Add another simple process struct inside processinternal and use it as the real holder or holder for the non shared fields between threads instead of the protobuf one https://github.com/cilium/tetragon/blob/main/pkg/process/process.go#L42
  2. Always collect process credentials and namespace during exec/clone and store directly in this new simple form (caps as int64...) inside process cache, get rid of of these protobuf capabilities and namespaces https://github.com/cilium/tetragon/blob/main/pkg/process/process.go#L44 too. When exporting events we chose what fields we export... --enable-process-creds etc or have better fields filters...
  • Decoding events to protobuf only when exporting events may add cpu overhead as now we decode once and use for both execve / exit , kprobes , etc, this also may do unnecessary GetProcessCopy() on all fields since we have to update the copies anyway for non main threads... (this is still in progress). so I think the schema below could work:

We may end up with:

// ProcessInternal is the internal representation of a process.
// nolint:revive // This is an acceptable case of "stuttering" since the name "Internal"
// wouldn't make much sense by itself.
type ProcessInternal struct {
        // internal process simple representation holder of the possible non shared information
        // that is the per thread info
        mainThread processSimple

	// muproto protects the modifications to cached proto process.
	muproto sync.Mutex
        // decoded protobuf fields externally visible process struct, decode once and serve many
	process *tetragon.Process

	// garbage collector metadata
	color  int
        refcnt uint32
}
  • The per thread fields are stored in the mainThread object. When we operate on the main thread, execve/exit, kprobe, tracepoints etc we compare the thread ID, if main thread use the cached protobuf fields. If not main thread then compare what differs from the mainThread struct and decode that (non shared) and use it.

This could make processInternal somehow immutable as pointed by @kkourt until we start to hook in the sensors call sites that may change those fields...

@tixxdz
Copy link
Member Author

tixxdz commented Jul 29, 2023

issue here: #1290

@tixxdz
Copy link
Member Author

tixxdz commented Jul 29, 2023

@kkourt ProcessInternal can be mutable but *process can not be changed while being handled by JSON/GRPC writers or else we risk corrupting the output.

@jrfastab what are the scenarios where processInternal can be mutable? just ensuring I didnt miss any. IIRC the matchCapabilitiesChange updates the bpf execve_map but not the userspace process cache and that's part of kprobe not a main sensor.

I think if we want to reflect this we have first to hook by default the setuid() family as a main sensor too, update the credentials on the bpf execve_map then reflect this on the user space process cache too... we do it step by step, so we don't get tricked by some exploits etc...

@olsajiri
Copy link
Contributor

olsajiri commented Aug 1, 2023

I'm probably missing something, but I'm getting bit lost on how the TID works now ;-)

IIRC we have following events flow:

- fork      (TID == PID)
- exec      (TID == PID)
- [ku]probe or tracepoint (we record TID and create extra copy of Process in user space with this TID for report)
- exit      (TID == PID)

could we have comment like that somewhere explaining the current situation?

also maybe we could remove some of the tid fields from the kernel->user events,
that gets initialized as pids anyway, seems confusing

@tixxdz
Copy link
Member Author

tixxdz commented Aug 1, 2023

I'm probably missing something, but I'm getting bit lost on how the TID works now ;-)

IIRC we have following events flow:

- fork      (TID == PID)
- exec      (TID == PID)
- [ku]probe or tracepoint (we record TID and create extra copy of Process in user space with this TID for report)
- exit      (TID == PID)

could we have comment like that somewhere explaining the current situation?

also maybe we could remove some of the tid fields from the kernel->user events, that gets initialized as pids anyway, seems confusing

indeed will do

@jrfastab
Copy link
Contributor

jrfastab commented Aug 2, 2023

Its not clear to me what the issue created is about? In general IMO we rely to heavily on protobufs in lower edges of stack, but that is about optimizing performance and json encoder.

The mutex in processInternal is there to serialize writers/readers of processInternal. Somehow the 3 bullets above were lost on me.

@tixxdz
Copy link
Member Author

tixxdz commented Aug 18, 2023

I'm probably missing something, but I'm getting bit lost on how the TID works now ;-)
IIRC we have following events flow:

- fork      (TID == PID)
- exec      (TID == PID)
- [ku]probe or tracepoint (we record TID and create extra copy of Process in user space with this TID for report)
- exit      (TID == PID)

could we have comment like that somewhere explaining the current situation?

@olsajiri I did extensively last patch with commit log, please have a look

also maybe we could remove some of the tid fields from the kernel->user events, that gets initialized as pids anyway, seems confusing

I agree that we should remove it as some point, but for now it is useful as it helps debug things and ensure our model is right, and we are still not finished with handling all threads please take a look at the last patch commit log and to this comment #1256 (comment)

@netlify
Copy link

netlify bot commented Aug 18, 2023

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit 9c43b9d
🔍 Latest deploy log https://app.netlify.com/sites/tetragon/deploys/650b65e9e12a760008dc5dc2
😎 Deploy Preview https://deploy-preview-1256--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@tixxdz
Copy link
Member Author

tixxdz commented Aug 18, 2023

Its not clear to me what the issue created is about? In general IMO we rely to heavily on protobufs in lower edges of stack, but that is about optimizing performance and json encoder.

The mutex in processInternal is there to serialize writers/readers of processInternal. Somehow the 3 bullets above were lost on me.

@jrfastab fair enough, I updated last patch with extensive commit log about what needs to be fixed, and when doing those we will fix it if we have to.

pkg/grpc/exec/exec.go Show resolved Hide resolved
pkg/grpc/exec/exec.go Show resolved Hide resolved
@tixxdz tixxdz added the release-note/minor This PR introduces a minor user-visible change label Sep 11, 2023
@tixxdz
Copy link
Member Author

tixxdz commented Sep 11, 2023

Ok @olsajiri updated based on all comments, and yes it turned more work with documenting all the stuff that it is still missing as you suggested. so basically we are not finished with all per thread transition hence some extra debug and warning messages... here and there.

pkg/grpc/exec/exec.go Outdated Show resolved Hide resolved
pkg/grpc/exec/exec.go Outdated Show resolved Hide resolved
pkg/grpc/exec/exec.go Outdated Show resolved Hide resolved
pkg/grpc/exec/exec.go Outdated Show resolved Hide resolved
pkg/process/process.go Outdated Show resolved Hide resolved
@jrfastab
Copy link
Contributor

I find this commit message to be confusing,

 We want to correlate the exit and exec events, for this the final TID
of the exit event _must matches_ the PID and TID of exec / clone events
since we handle all these by the main thread.

The tetragonEvent.Process of the exit event is constructed either:

  1. When looking up the process by its PID from user space cache
     then we get the process that was pushed during exec or clone.

  2. Out of order exec event, and this is same as (1).

For all these cases the TID of the exit should be _automatically set_ to
the TID and PID of the exec/clone event. To achieve this we ensure that
clone and exec events have their TID also set to PID.

* clone events must store the TID in the process cache directly and
  we ensure that it is the PID as we generate only one event that
  is the thread leader.

* exec event we ensure and automatically set TID to PID too and insert it
  into the process cache as it is the main thread doing the execve call.

However from bpf side:
We also ensure that exit events on their own must send bpf PID == TID,
if not let's print a warning that is useful to bisect things.

This helps to assert the different variants of tetragon and that
sensors or bpf part did not change on behalf of user space and cache
handling, as again we handle only one main thread in the bpf side and
also in process cache, and we still did not complete all the per-thread
fields transition.

My sort of understanding is something like,

Exit events should have TID==PID and the {TID,PID} in the exit event must match the {TID,PID} pair from the exec event. They must match because its how we link the exit event to the exec event. If TID!=PID or the Exit{TID,PID} pair does not match an Exec{TID,PID} pair then there is a bug.

This patch adds extra logic to WARN on conditions where TID!=PID to aid debugging and catch this unexpected case. Typically this indicates a bug either in BPF or userspace caching logic. When this condition is encountered we attempt to correct it by setting the TID=PID.

"event.process.binary": binary,
"event.process.exec_id": execID,
"event.parent.exec_id": parentExecID,
}).Warn("ExecveEvent: process PID and TID mismatch")
Copy link
Contributor

@jrfastab jrfastab Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These warnings need to be metric error counters so we can actually find them. No one will notice a rare and random warning in the logs. (Note I wouldn't block this PR on the metric implementation but I do think we should get in soon to catch this.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all fixed except for this adding metrics, will do in separate PR, so I don't mess up this one ;-)

bpf/process/bpf_exit.h Outdated Show resolved Hide resolved
bpf/process/bpf_fork.c Outdated Show resolved Hide resolved
Copy link
Contributor

@jrfastab jrfastab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments aside looks good to me. Few small cleanups and I think we can get this merged.

Exit events should have TID==PID at same time we want to correlate
the {TID,PID} of the exit event with the {TID,PID} pair from the exec
event. They must match because its how we link the exit event to the
exec one.

If TID != PID or the Exit{TID,PID} pair does not match Exec{TID,PID}
pair then this is a bug.

This patch adds extra logic to WARN on conditions where TID!=PID to
aid debugging and catch this unexpected case. Typically this indicates
a bug either in BPF or userspace caching logic. When this condition
is encountered we attempt to correct it by setting the TID=PID on
Clone and Exec events.

Signed-off-by: Djalal Harouni <[email protected]>
Remove the unnecessary GetProcessCopy() and use the previous
refernce of the process. That reference if found in the process
cache already has TID set when it was pushed during exec or clone.

Signed-off-by: Djalal Harouni <[email protected]>
Our current rules now for how we collect thread IDs are:

During bpf we send both TID and PID where:
- fork      TID == PID
- exec      TID == PID
- [ku]probe or tracepoint  TID could be different as per thread ID.
- exit      TID == PID  => this is to match the exec evnet.

At user space we have one entry that is the thread leader collected
either during clone or exec:
- fork      TID == PID   (asserts TID == PID received from BPF side)
- exec      TID == PID   ditto
- exit      TID == PID  => this is to match the exec evnet.
- [ku]probe or tracepoint  We make a copy of the process that is the
  thread leader in the process cache then update its TID that was recorded
  from bpf side. The copy is needed so we don't corrupt gRPC handling.

Now this is by far complete, future fixes should include:
- Record the capabilities and namespaces per thread in BPF side for
  [ku]probe and tracpoints and ensure to not overwrite the fields of the
  thread leader that are in the execve_map in bpf side or
  the user space process cache with fields of another thread, as that
  cache contains only one thread the leader. [Need to recheck sources]

- Also ensure that [ku]probe and tracpoints events do use the per thread
  capabilities and namespaces fields collected from bpf side instead of
  the fields of the leader that is in the process cache which were
  collected during exec or during match filters... and not at current
  time.

- Ensure that we always collect thread leader fields and we cache them
  in our shadow state in execve_map and user space process cache,
  reguardless of --enable-process-creds and a like flags.

- When all these fixed and the model is clear, maybe by then we can remove
  the extra recording of TIDs from bpf side during fork/clone, exec and
  exit as we should by then asserted our model.

Note that sending the TID that equals PID on clone exec and exit from BPF
side and the assertion on the user space helps to catch errors for other
Tetragon variants that use the OSS version as a base with custom
sensors.  The downside of this is we are just sending an extra 4bytes
from bpf which is also fine.

Signed-off-by: Djalal Harouni <[email protected]>
@tixxdz
Copy link
Member Author

tixxdz commented Sep 20, 2023

I find this commit message to be confusing,

 We want to correlate the exit and exec events, for this the final TID
of the exit event _must matches_ the PID and TID of exec / clone events
since we handle all these by the main thread.

The tetragonEvent.Process of the exit event is constructed either:

  1. When looking up the process by its PID from user space cache
     then we get the process that was pushed during exec or clone.

  2. Out of order exec event, and this is same as (1).

For all these cases the TID of the exit should be _automatically set_ to
the TID and PID of the exec/clone event. To achieve this we ensure that
clone and exec events have their TID also set to PID.

* clone events must store the TID in the process cache directly and
  we ensure that it is the PID as we generate only one event that
  is the thread leader.

* exec event we ensure and automatically set TID to PID too and insert it
  into the process cache as it is the main thread doing the execve call.

However from bpf side:
We also ensure that exit events on their own must send bpf PID == TID,
if not let's print a warning that is useful to bisect things.

This helps to assert the different variants of tetragon and that
sensors or bpf part did not change on behalf of user space and cache
handling, as again we handle only one main thread in the bpf side and
also in process cache, and we still did not complete all the per-thread
fields transition.

My sort of understanding is something like,

Exit events should have TID==PID and the {TID,PID} in the exit event must match the {TID,PID} pair from the exec event. They must match because its how we link the exit event to the exec event. If TID!=PID or the Exit{TID,PID} pair does not match an Exec{TID,PID} pair then there is a bug.

This patch adds extra logic to WARN on conditions where TID!=PID to aid debugging and catch this unexpected case. Typically this indicates a bug either in BPF or userspace caching logic. When this condition is encountered we attempt to correct it by setting the TID=PID.

Thanks @jrfastab I changed the commit log to what you suggested and also inside the code ;-)

@tixxdz
Copy link
Member Author

tixxdz commented Sep 20, 2023

My comments aside looks good to me. Few small cleanups and I think we can get this merged.

@olsajiri I did all your changes, plus last requests from @jrfastab too! so I will merge this one if it's green ;-)

@tixxdz tixxdz merged commit 9b9ed8b into main Sep 20, 2023
31 checks passed
@tixxdz tixxdz deleted the pr/tixxdz/use-tid-directly branch September 20, 2023 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/minor This PR introduces a minor user-visible change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants