Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling improvements #159

Closed
javierhonduco opened this issue Apr 28, 2022 · 2 comments · Fixed by #367
Closed

Error handling improvements #159

javierhonduco opened this issue Apr 28, 2022 · 2 comments · Fixed by #367
Labels
bug Something isn't working

Comments

@javierhonduco
Copy link
Contributor

javierhonduco commented Apr 28, 2022

👋 . Been using libbpfgo recently and found some problems with the way errors are being handled. Would love to hear your opinion and comments on this subject 😄 .

Context

While helping @kakkoyun debug #147 to make our map retrieval and deletion path more efficient, we were puzzled as the kernel did not seem to return -EPERM in the code path we were executing.

This can be seen with this bpftrace script:

$ sudo ./bpfsnoop_delete_batch.sh `pidof parca-agent`
Attaching 4 probes...

return from BPF syscall
        syscall+40
        0xfffffffff000
        runtime.asmcgocall.abi0+124
 with -2
return from libbpf
        _cgo_3f570fc81fe2_C2func_bpf_map_lookup_and_delete_batch+1196
        runtime.asmcgocall.abi0+124
 with -1

As we can see here, the "raw" (without any massaging of the return code by libc) BPF syscall is returning -2 (ENOENT), while libbpf is seeing -1 (EPERM). Why is this happening?

libc's syscall(2), which libbpf calls here changes the return code to -1 on error, and sets errno to the actual return value of the "raw" syscall.

The syscall(2) manpage:

The return value is defined by the system call being invoked. In
general, a 0 return value indicates success. A -1 return value
indicates an error, and an error number is stored in errno.

🤔 So it seems we thought we were getting EPERM when in reality, it was "just" -1 indicating that the function failed (*) and that we had to check errno to see what the actual error is. Interpreting the return value as errno is not correct here.

As far as I understand, cgo calls have two return values, the return code, and errno (https://pkg.go.dev/cmd/cgo). In several of libbpfgo's calls to libbpf, only one return value is checked, which if I am not mistaken, it would be the return code of the function.

How this is affecting libbpfs APIs right now

Besides the batch APIs not returning the data correctly, even though we might have processed it just fine, this is also affecting some other APIs.

Let's take a look at this fragment extracted from our codebase. We want to delete keys from a map, but ignore if the key is non-existent. After taking a look at the bpf manpage we saw that ENOENT means exactly this, so let's ignore it and fail otherwise.

err := m.stacks.DeleteKey(unsafe.Pointer(&stackId))
if err != nil {
  if !errors.Is(err, syscall.ENOENT) {
    return fmt.Errorf("failed to delete stack trace: %w", err)
  }
}

Unfortunately, this code isn't correct and will return with an error, not just when the key doesn't exist. As DeleteKey is just returning one argument, the return value is -1 which indicates a generic failure. We are interpreting it as if it were an errno value (EPERM), which is not correct.

By reading errno and passing it to the callee, we would know which error this is and be able to handle it accordingly. A possible fix for DeleteKey could be:

func (b *BPFMap) DeleteKey(key unsafe.Pointer) error {
	ret, errC := C.bpf_map_delete_elem(b.fd, key)
	if ret != -1 {
		return fmt.Errorf("failed to get lookup key %d from map %s: %w", key, b.name, errC)
	}
	return nil
}

The road to Libbpf 1.0

Another possibility to fix this is to turn on LIBBPF_STRICT_DIRECT_ERRS (https://github.com/libbpf/libbpf/blob/b69f8ee93ef6aa3518f8fbfd9d1df6c2c84fd08f/src/libbpf_internal.h#L492) with libbpf_set_strict_mode as described in https://github.com/libbpf/libbpf/wiki/Libbpf-1.0-migration-guide.

While this might fix the problems we are experiencing, perhaps it would be a bit of a big change that could potentially break things, so careful testing should be done. For example, if we are currently checking for -1 for errors, we should check for negative values instead.

At the same time, at some point, the v1.0 migration will happen, so maybe it could be a good excuse to start migrating to the new behavior?

Related issues / PRs

Thanks!! Looking forward to your thoughts!

cc/ @kakkoyun @v-thakkar @Sylfrena

(*) There's one exception that I am aware of, batch APIs, which might return -1, but depending on the errno value we might have achieved a partial success. More context on #157.

@javierhonduco javierhonduco changed the title Error handling: read return error and errno Error handling improvements Apr 28, 2022
javierhonduco added a commit to javierhonduco/libbpfgo that referenced this issue Apr 28, 2022
javierhonduco added a commit to javierhonduco/libbpfgo that referenced this issue Apr 28, 2022
javierhonduco added a commit to javierhonduco/libbpfgo that referenced this issue Apr 28, 2022
javierhonduco added a commit to javierhonduco/libbpfgo that referenced this issue Apr 28, 2022
javierhonduco added a commit to javierhonduco/libbpfgo that referenced this issue Apr 28, 2022
javierhonduco pushed a commit to javierhonduco/parca-agent that referenced this issue May 5, 2022
This PR is based off [@kakkoyun's work](parca-dev#326)
to use libbpf(go)'s batch APIs.

**Context**
The main issue we found while working with this API was that they were erroring with `EPERM`.
After some debugging, we realised that libbpf wasn't handle with errors in the way we
expected.

The debugging write-up and more context can be found [here](aquasecurity/libbpfgo#159),
and the fix is in [this PR](aquasecurity/libbpfgo#157).

After landing these changes upstream, we pointed to the updated libbpfgo version, as well as added
some [regression tests](parca-dev#381) to ensure that libbpfgo
behaves as expected, and to make it easier in the future to write further compatibility tests.

Note: the rest of the batch APIs error handling is still unfixed. Tracking in aquasecurity/libbpfgo#162.
javierhonduco added a commit to javierhonduco/parca-agent that referenced this issue May 5, 2022
This PR is based off [@kakkoyun's work](parca-dev#326)
to use libbpf(go)'s batch APIs.

**Context**
The main issue we found while working with this API was that they were erroring with `EPERM`.
After some debugging, we realised that libbpf wasn't handle with errors in the way we
expected.

The debugging write-up and more context can be found [here](aquasecurity/libbpfgo#159),
and the fix is in [this PR](aquasecurity/libbpfgo#157).

After landing these changes upstream, we pointed to the updated libbpfgo version, as well as added
some [regression tests](parca-dev#381) to ensure that libbpfgo
behaves as expected, and to make it easier in the future to write further compatibility tests.

Note: the rest of the batch APIs error handling is still unfixed. Tracking in aquasecurity/libbpfgo#162.
javierhonduco added a commit to javierhonduco/parca-agent that referenced this issue May 5, 2022
This PR is based off [@kakkoyun's work](parca-dev#326)
to use libbpf(go)'s batch APIs.

**Context**
The main issue we found while working with this API was that they were erroring with `EPERM`.
After some debugging, we realised that libbpf wasn't handle with errors in the way we
expected.

The debugging write-up and more context can be found [here](aquasecurity/libbpfgo#159),
and the fix is in [this PR](aquasecurity/libbpfgo#157).

After landing these changes upstream, we pointed to the updated libbpfgo version, as well as added
some [regression tests](parca-dev#381) to ensure that libbpfgo
behaves as expected, and to make it easier in the future to write further compatibility tests.

Note: the rest of the batch APIs error handling is still unfixed. Tracking in aquasecurity/libbpfgo#162.
javierhonduco added a commit to javierhonduco/parca-agent that referenced this issue May 5, 2022
This PR is based off [@kakkoyun's work](parca-dev#326)
to use libbpf(go)'s batch APIs.

**Context**
The main issue we found while working with this API was that they were erroring with `EPERM`.
After some debugging, we realised that libbpf wasn't handle with errors in the way we
expected.

The debugging write-up and more context can be found [here](aquasecurity/libbpfgo#159),
and the fix is in [this PR](aquasecurity/libbpfgo#157).

After landing these changes upstream, we pointed to the updated libbpfgo version, as well as added
some [regression tests](parca-dev#381) to ensure that libbpfgo
behaves as expected, and to make it easier in the future to write further compatibility tests.

Note: the rest of the batch APIs error handling is still unfixed. Tracking in aquasecurity/libbpfgo#162.
javierhonduco added a commit to javierhonduco/parca-agent that referenced this issue May 5, 2022
This PR is based off [@kakkoyun's work](parca-dev#326) to use libbpf(go)'s batch APIs.

**Context**
The main issue we found while working with this API was that they were erroring with `EPERM`.
After some debugging, we realised that libbpf wasn't handling errors in the way we
expected.

The debugging write-up and more context can be found [here](aquasecurity/libbpfgo#159),
and the fix is in [this PR](aquasecurity/libbpfgo#157).

After landing these changes upstream, we pointed to the updated libbpfgo version, as well as added
some [regression tests](parca-dev#381) to ensure that libbpfgo
behaves as expected, and to make it easier in the future to write further compatibility tests.

Note: the rest of the batch APIs error handling is still unfixed. Tracking in aquasecurity/libbpfgo#162.
@geyslan
Copy link
Member

geyslan commented Aug 28, 2023

Unfortunately, this code isn't correct and will return with an error, not just when the key doesn't exist. As DeleteKey is just returning one argument, the return value is -1 which indicates a generic failure. We are interpreting it as if it were an errno value (EPERM), which is not correct.

@javierhonduco on my last efforts I've faced that when using GetMapInfoByFD():

libbpfgo/libbpfgo.go

Lines 418 to 426 in 99b7ef7

fd := bpfMap.FileDescriptor()
info, err := GetMapInfoByFD(fd)
if err != nil {
// Compatibility Note: Some older kernels lack BTF (BPF Type Format)
// support for specific BPF map types. In such scenarios, libbpf may
// fail (EPERM) when attempting to retrieve information for these maps.
// Reference: https://elixir.bootlin.com/linux/v5.15.75/source/tools/lib/bpf/gen_loader.c#L401
//
// However, we can still get some map info from the BPF map high level API.

As it was not the focus, I left it to analyse later. @josedonizetti I think we need to standardize all cgo returns by getting the errno value directly (avoiding re-converting the libbpf ret value). WDYT @javierhonduco?

@geyslan geyslan added the bug Something isn't working label Aug 28, 2023
@geyslan
Copy link
Member

geyslan commented Aug 28, 2023

Actually the EPERM that I faced is real. It was solved by raising capabilities. I'm going to fix that comment above (#371).

Regarding the error handling, we're based on libbpf 1.0 API already and it enforced the strict mode since libbpf/libbpf@c261131.

The current behaviour is:

https://github.com/libbpf/libbpf/blob/5a46421ad837e876197295844696884c8587852a/src/libbpf_internal.h#L490-L523

The err number is always being returned as the main CGO return but from libbpf_err_ptr() and libbpf_ptr() when we need to access the second value (exported errno) if first is NULL.

Screenshot from 2023-08-28 16-54-27

I've been trying to standardize that by multiple dependent PRs:

If they get merged, let's close this.

geyslan added a commit to geyslan/libbpfgo that referenced this issue Aug 28, 2023
On a GetMapInfoByFD() failure, the comment was wrong.

See: aquasecurity#159 (comment)
This was referenced Aug 28, 2023
geyslan added a commit that referenced this issue Aug 29, 2023
On a GetMapInfoByFD() failure, the comment was wrong.

See: #159 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants