Handle non-utf8 strings in protobuf structures #1282

kkourt · 2023-07-28T06:27:01Z

BPF/kernel strings are C strings which are a sequence of null-terminated
bytes. They may or may not be valid utf-8 strings.

For example, execve() can accept invalid utf-8 strings. Similarly,
filenames are not required to be utf-8 encoded.

This becomes an issue because we define fields like Binary, Arguments,
and CWD as strings in the proto descriptions.

According to the protobuf spec:
"A string must always contain UTF-8 encoded or 7-bit ASCII text, and
cannot be longer than 2^32."

As a result, when passing non-utf8 data as strings, protobuf clients
(e.g., the JSON writer and the gRPC client) cannot handle the event. For
example, running tetra getevents and executing a program with invalid
utf8 arguments leads the following error:
msg="Failed to receive events" error="rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8"

There are a number of different approaches we can use to address this
problem. One is that we can try to encode arbitrary bytes in the
string. In the past, we have used base64 to do that (specifically, for
bytes arguments). An alternative (better?) solution would be to quote
the string, using strconv.Quote() or similar.

Quoting, however, means that we need to always parse the data and quote
them, which has a performance cost. To avoid this cost this patch
uses strings.ToValidUTF8 instead, which has small overhead (minimal in
case where the data are actually valid utf8), to replace invalid runes
with "�". This means that we loose information (what the actual bytes
were) but we also keep backwards compatibility and remain close to the
existing behavior.

In the future, we can modify this behavior (e.g., via a command line
switch) to quote the arguments instead, so that all bytes are preserved
for non-utf8 data encoded as strings.

A more radical change (which seems like the right thing to do) is to
change the gRPC fields that are not actually strings to bytes, and let
the clients do the decoding.

Reported-by: Гаврилов Иван Сергеевич [email protected]
Signed-off-by: Kornilios Kourtis [email protected]

Ensure that protobuf strings are valid utf-8

BPF/kernel strings are C strings which are a sequence of null-terminated bytes. They may or may not be valid utf-8 strings. For example, execve() can accept invalid utf-8 strings. Similarly, filenames are not required to be utf-8 encoded. This becomes an issue because we define fields like Binary, Arguments, and CWD as strings in the proto descriptions. According to the protobuf spec: "A string must always contain UTF-8 encoded or 7-bit ASCII text, and cannot be longer than 2^32." As a result, when passing non-utf8 data as strings, protobuf clients (e.g., the JSON writer and the gRPC client) cannot handle the event. For example, running `tetra getevents` and executing a program with invalid utf8 arguments leads the following error: msg="Failed to receive events" error="rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8" There are a number of different approaches we can use to address this problem. One is that we can try to encode arbitrary bytes in the string. In the past, we have used base64 to do that (specifically, for bytes arguments). An alternative (better?) solution would be to quote the string, using strconv.Quote() or similar. Quoting, however, means that we need to always parse the data and quote them, which has a performance cost. To avoid this cost this patch uses strings.ToValidUTF8 instead, which has small overhead (minimal in case where the data are actually valid utf8), to replace invalid runes with "�". This means that we loose information (what the actual bytes were) but we also keep backwards compatibility and remain close to the existing behavior. In the future, we can modify this behavior (e.g., via a command line switch) to quote the arguments instead, so that all bytes are preserved for non-utf8 data encoded as strings. A more radical change (which seems like the right thing to do) is to change the gRPC fields that are not actually strings to bytes, and let the clients do the decoding. The patch also adds a unit test to check for this case. Reported-by: Гаврилов Иван Сергеевич <[email protected]> Signed-off-by: Kornilios Kourtis <[email protected]>

handleGenericKprobeString returns "/" in case of an error. This happens to accommodate issues when traversing paths. This patch modifies handleGenericKprobeString so that it accepts a default value to return as an error. The intention is to use it for other types, which be done in the next patch. No functional changes. Signed-off-by: Kornilios Kourtis <[email protected]>

Use handleGenericKprobeString for GenericFilenameType and GenericStringType as well. The code is the same, so no functional changes. Signed-off-by: Kornilios Kourtis <[email protected]>

Kprobe arguments that include strings are generated using handleGenericKprobeString. Because protobufs only support strings that are valid utf-8, we need to ensure that the strings we get from bpf are valid utf-8. Also, ensure that Path in loader is valid utf-8. Signed-off-by: Kornilios Kourtis <[email protected]>

This code is seemingly not used. Remove it. Signed-off-by: Kornilios Kourtis <[email protected]> Suggested-by: Anastasios Papagiannis <[email protected]>

tpapagian

LGTM, thanks! ~~(assuming tests are green)~~

kkourt added 5 commits July 28, 2023 08:23

tracing: use handleGenericKprobe

0b15af6

Use handleGenericKprobeString for GenericFilenameType and GenericStringType as well. The code is the same, so no functional changes. Signed-off-by: Kornilios Kourtis <[email protected]>

grpc/tracing: remove unused type in HandleMessage

81290bd

This code is seemingly not used. Remove it. Signed-off-by: Kornilios Kourtis <[email protected]> Suggested-by: Anastasios Papagiannis <[email protected]>

kkourt requested a review from a team as a code owner July 28, 2023 06:27

kkourt requested a review from tpapagian July 28, 2023 06:27

kkourt changed the title ~~Pr/kkourt/nonutf8~~ handle non-utf8 strings in protobuf structures Jul 28, 2023

kkourt changed the title ~~handle non-utf8 strings in protobuf structures~~ Handle non-utf8 strings in protobuf structures Jul 28, 2023

tpapagian approved these changes Jul 28, 2023

View reviewed changes

kkourt merged commit eebb6ba into main Jul 28, 2023
21 checks passed

kkourt deleted the pr/kkourt/nonutf8 branch July 28, 2023 07:47

kkourt added release-note/bug This PR fixes an issue in a previous release of Tetragon. needs-backport/0.9 labels Jul 28, 2023

This was referenced Jul 28, 2023

v0.10 backports #1284

Closed

v0.10 backports #1285

Merged

v0.9 backports #1286

Merged

kkourt added backport-done/0.9 and removed needs-backport/0.9 labels Jul 28, 2023

kkourt mentioned this pull request Jul 31, 2023

Tetragon gRPC API returns "error="rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8" command terminated with exit code 1" #1275

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle non-utf8 strings in protobuf structures #1282

Handle non-utf8 strings in protobuf structures #1282

kkourt commented Jul 28, 2023 •

edited

Loading

tpapagian left a comment •

edited

Loading

Handle non-utf8 strings in protobuf structures #1282

Handle non-utf8 strings in protobuf structures #1282

Conversation

kkourt commented Jul 28, 2023 • edited Loading

tpapagian left a comment • edited Loading

Choose a reason for hiding this comment

kkourt commented Jul 28, 2023 •

edited

Loading

tpapagian left a comment •

edited

Loading