Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kern: fix client-controlled kernel panic in kipc #1961

Merged
merged 1 commit into from
Jan 25, 2025

Conversation

cbiffle
Copy link
Collaborator

@cbiffle cbiffle commented Dec 26, 2024

It turns out it's possible to trigger a bounds check in kipc handling code by submitting an invalid IRQ number. This fixes that, causing the bounds check to be optimized out.

In several syscall paths (and one kipc) we manipulate interrupts without accepting an interrupt number from the client. In these cases, a bad interrupt number would indicate a problem in our dispatch table, so I've made them explicit panics. (They were panics before, just implicit ones in the array access.)

This causes a small increase in kernel text size (40-64 bytes), but kipc should not be able to crash the kernel!

@cbiffle cbiffle requested a review from hawkw December 26, 2024 04:22
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, great catch! I commented on some ways we could potentially golf the binary size impact down a bit but I don't really think any of them are actually worth pursuing...

// This can only fail if the IRQ number is out of range, which
// in this case would mean the hardware is conspiring against
// us. So ignore it to ensure we don't generate a bogus check.
disable_irq(irq_num, false).ok();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you prefer this over let _ because the latter is a bit of a footgun in other contexts, but there is a part of me that kind of wonders if

fn_returning_result().ok();

optimizes worse than

let _ = fn_returning_result();

because it's a function call, though hopefully LLVM can eliminate it...I might have to go check.

EDIT: After spending some time in Compiler Explorer, I have been proven definitively wrong about this, in fact, the opt-level=2 compiler is so smart about this that it correctly realizes that these are actually the same function and doesn't even generate two separate functions for them:

#[no_mangle]
#[inline(never)]
pub extern fn ignore_err_ok() {
    returns_result().ok();
    // pretend the function then does other stuff
    core::hint::black_box(1);
}

#[no_mangle]
#[inline(never)]
pub extern fn ignore_err_underscore() {
    let _ = returns_result();
    // pretend the function then does other stuff
    core::hint::black_box(1);
}

So, I have seen the error of my ways and should not have doubted the use of .ok() to ignore Results!.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always worth checking in cases like this! The best codegen changes over time, too, so the ground truth back when I developed this habit might be different from today. So, thanks for chasing it down.

Comment on lines +924 to +931
irqs.iter().try_fold(IrqStatus::empty(), |status, irq| {
crate::arch::irq_status(irq.0).map(|n| status | n)
})?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's nice how this was so trivially converted to fallible with try_fold! :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't reach for the fold family of functions often, but sometimes they're just the ticket!

pub fn pend_software_irq(InterruptNum(n): InterruptNum) {
pub fn pend_software_irq(
InterruptNum(n): InterruptNum,
) -> Result<(), UsageError> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to say "hmm, shouldn't we update the documentation for the software_irq KIPC to reflect that it will fault you if you ask for a nonexistent IRQ...but then, I realized that the Hubris reference docs for KIPCs doesn't even mention software_irq, and instead suggests that find_faulted_task is KIPC 8, which it isn't (it's KIPC 9, and software_irq is 8). This is almost certainly my fault back when I added the software IRQ KIPC, so I should probably fix that...

Comment on lines 1238 to 1274
unsafe {
nvic.icer[reg_num].write(bit_mask);
nvic.icer
.get(reg_num)
.ok_or(UsageError::NoIrq)?
.write(bit_mask);
}
if also_clear_pending {
unsafe {
nvic.icpr[reg_num].write(bit_mask);
nvic.icpr
.get(reg_num)
.ok_or(UsageError::NoIrq)?
.write(bit_mask);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we wanted to be really fastidious about code size, we could potentially do a single check that the IRQ number is in range, and then make both of the actual accessess unchecked; if rustc isn't smart enough to know that if it's in range for ICER, it's also in range for ICPR? but, I dunno if it's worth the more manual checking to save 40 bytes of kernel --- feels sketchier for sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I'd be curious how much that'd actually save -- probably both the error return paths for these expressions get merged into a single epilogue, but I could be wrong.

I feel like the result is potentially more fragile, though, so I hesitate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is fine! It's just a thought that occurred to me.

Comment on lines +466 to +468
// Any error here would be a problem in our dispatch table, not the
// caller, so we panic because we want to hear about it.
crate::arch::pend_software_irq(irq).unwrap_lite();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm --- this is the only call site of pend_software_irq, right? So, if we're going to unwrap here, we could just leave that function using infallible array indexing and let it panic, instead. I dunno if that has a meaningful impact on binary size, though, and this is maybe nicer in the event that we add other uses of pend_software_irq later, which don't want to panic. I think having it be the caller's decision whether or not to panic is almost certainly the ideologically nicer factoring...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd want to provide callers with some way of checking if an IRQ is valid for the architecture then, since it's in the arch module. The current factoring makes it responsible for the check, using its architecture-specific knowledge, which I suspect results in simpler code at the use sites.

But as you say, there is only one use site atm, and I'm not sure how likely that is to change.

@hawkw
Copy link
Member

hawkw commented Dec 26, 2024

Incidentally --- and this is extremely pedantic of me --- I don't think the title/commit message message for this PR is quite right. The client-controlled panic is actually in the syscalls, rather than the software_irq KIPC. The KIPC implementation is the one that still unwrap_lite()s and will panic, as it gets the notification-mask-to-IRQ mappings from the kernel's dispatch table.

hawkw added a commit that referenced this pull request Dec 26, 2024
I [recently discovered][1] that the `software_irq` KIPC call is missing
reference documentation. In fact, the KIPC documentation doesn't even
mention it (in contrast to other KIPC calls that have no documentation
but are still listed in the reference). Instead, the reference
documentation currently incorrectly states that `find_faulted_task` is
KIPC number 8, when it's actually KIPC 9 and `software_irq` is number 8.

Not adding this KIPC to the docs was an oversight on my part in PR #1661
--- my bad, sorry!

This commit adds documentation for the `software_irq` KIPC to the
reference.

[1]: #1961 (comment)
hawkw added a commit that referenced this pull request Jan 15, 2025
I [recently discovered][1] that the `software_irq` KIPC call is missing
reference documentation. In fact, the KIPC documentation doesn't even
mention it (in contrast to other KIPC calls that have no documentation
but are still listed in the reference). Instead, the reference
documentation currently incorrectly states that `find_faulted_task` is
KIPC number 8, when it's actually KIPC 9 and `software_irq` is number 8.

Not adding this KIPC to the docs was an oversight on my part in PR #1661
--- my bad, sorry!

This commit adds documentation for the `software_irq` KIPC to the
reference.

[1]: #1961 (comment)
@cbiffle
Copy link
Collaborator Author

cbiffle commented Jan 16, 2025

Incidentally --- and this is extremely pedantic of me --- I don't think the title/commit message message for this PR is quite right. The client-controlled panic is actually in the syscalls, rather than the software_irq KIPC. The KIPC implementation is the one that still unwrap_lite()s and will panic, as it gets the notification-mask-to-IRQ mappings from the kernel's dispatch table.

I think we're just using the terms a little differently -- the client-controlled panic here is in the kernel implementation of the KIPC operation, which is called from the syscall dispatcher (like all KIPC implementations). So the panic is not available to tasks that don't use KIPC.

Is there a wording of that you'd find clearer?

@cbiffle cbiffle force-pushed the cbiffle/kipc-panic branch from d858fef to 42abf36 Compare January 16, 2025 20:01
@mkeeter
Copy link
Collaborator

mkeeter commented Jan 21, 2025

Should we add this to the kernel test suite?

@hawkw
Copy link
Member

hawkw commented Jan 21, 2025

Incidentally --- and this is extremely pedantic of me --- I don't think the title/commit message message for this PR is quite right. The client-controlled panic is actually in the syscalls, rather than the software_irq KIPC. The KIPC implementation is the one that still unwrap_lite()s and will panic, as it gets the notification-mask-to-IRQ mappings from the kernel's dispatch table.

I think we're just using the terms a little differently -- the client-controlled panic here is in the kernel implementation of the KIPC operation, which is called from the syscall dispatcher (like all KIPC implementations). So the panic is not available to tasks that don't use KIPC.

Is there a wording of that you'd find clearer?

Ah, I see what you mean, nevermind, the confusion is purely on my end here.

It turns out it's possible to trigger a bounds check in kipc handling
code by submitting an invalid IRQ number. This fixes that, causing the
bounds check to be optimized out.

In several syscall paths (and one kipc) we manipulate interrupts without
accepting an interrupt number from the client. In these cases, a bad
interrupt number would indicate a problem in our dispatch table, so I've
made them explicit panics. (They were panics before, just implicit ones
in the array access.)

This causes a small increase in kernel text size (40-64 bytes), but kipc
should not be able to crash the kernel!
@cbiffle cbiffle force-pushed the cbiffle/kipc-panic branch from 42abf36 to 8ada44e Compare January 25, 2025 02:22
@cbiffle cbiffle enabled auto-merge (rebase) January 25, 2025 02:22
@cbiffle cbiffle merged commit 88617a2 into oxidecomputer:master Jan 25, 2025
125 checks passed
@cbiffle cbiffle deleted the cbiffle/kipc-panic branch January 25, 2025 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants