Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't leave parts of the bootloader in the kernel's address space #239

Open
Freax13 opened this issue Jun 26, 2022 · 7 comments · May be fixed by #240
Open

don't leave parts of the bootloader in the kernel's address space #239

Freax13 opened this issue Jun 26, 2022 · 7 comments · May be fixed by #240

Comments

@Freax13
Copy link
Member

Freax13 commented Jun 26, 2022

While implementing finer granular ASLR I came across this comment:

used.entry_state[0] = true; // TODO: Can we do this dynamically?

We mark the first 512GiB of the address space as unusable for dynamically generated addresses. I think we do this because we identity map the context switch code into kernel memory and this code most likely resides within the first 512GiB of the address space:
// identity-map context switch function, so that we don't get an immediate pagefault
// after switching the active page table
let context_switch_function = PhysAddr::new(context_switch as *const () as u64);
let context_switch_function_start_frame: PhysFrame =
PhysFrame::containing_address(context_switch_function);
for frame in PhysFrame::range_inclusive(
context_switch_function_start_frame,
context_switch_function_start_frame + 1,
) {
match unsafe {
kernel_page_table.identity_map(frame, PageTableFlags::PRESENT, frame_allocator)
} {
Ok(tlb) => tlb.flush(),
Err(err) => panic!("failed to identity map frame {:?}: {:?}", frame, err),
}
}

This causes a number of (admittedly small and unlikely) problems:

  • The identity mapped pages could overlap with the kernel or other mappings
  • We don't expose the identity mapped addresses to the kernel in Mappings
  • An attacker could make use of the identity mapped pages to defeat ASLR
  • We mark so a lot of usable memory as unusable and because of that we can't check for overlaps because there would be a lot of false positives. We currently just ignore overlaps.

We could probably work around those problems while still mapping parts of the bootloader into the kernel's address space, but I'd like to propose another solution: We use another very short lived page table to do the context switch. This page table would only map a few pages containing code that switches to the kernel's page table. Importantly, we would set the page table up in such a way that the kernel's entrypoint is just after the page table switch instruction, so we don't have to use any code to jump to the kernel, it would simply be the next instruction.
I don't think we could reliably map such code into the bootloader's address space because we'd have to map the code just before the kernel's entrypoint which could be close to bootloader's code, so that's why I want to use a short-lived page table.

We also identity map a GDT into the kernel's address space:

// create, load, and identity-map GDT (required for working `iretq`)
let gdt_frame = frame_allocator
.allocate_frame()
.expect("failed to allocate GDT frame");
gdt::create_and_load(gdt_frame);
match unsafe {
kernel_page_table.identity_map(gdt_frame, PageTableFlags::PRESENT, frame_allocator)
} {
Ok(tlb) => tlb.flush(),
Err(err) => panic!("failed to identity map frame {:?}: {:?}", gdt_frame, err),
}

We should probably make the GDT's location configurable and expose it in Mappings.

I'd be happy to work on a pr for this.

@bjorn3
Copy link
Contributor

bjorn3 commented Jun 26, 2022

Can't the kernel make a page table from scratch and simply not map this memory range to the bootloader? I would expect any kernel implementing KASLR or a userspace to build their page tables from scratch and not identity map anything. AFAIK only the physical memory map needs to be respected. The virtual memory mapping can vary freely as a kernel wishes.

@phil-opp
Copy link
Member

We use another very short lived page table to do the context switch. This page table would only map a few pages containing code that switches to the kernel's page table. Importantly, we would set the page table up in such a way that the kernel's entrypoint is just after the page table switch instruction, so we don't have to use any code to jump to the kernel, it would simply be the next instruction.

Interesting idea! However, AFAIK the kernels entry point address can be an arbitrary offset, e.g. in the middle of the .text section. So the memory before the entry point might already be used by other kernel code.

@Freax13
Copy link
Member Author

Freax13 commented Jun 27, 2022

Can't the kernel make a page table from scratch and simply not map this memory range to the bootloader? I would expect any kernel implementing KASLR or a userspace to build their page tables from scratch [...]

Well in theory a kernel could do anything that we do in stage 4, so yeah they could totally just create their own page tables, but I'd argue that we shouldn't expect kernels to do that. Personally, in my kernel, I copy and update the page table created by the bootloader, but never create a new page table completely from scratch, and it's been working great.

[...] and not identity map anything.

That's exactly my point, none of the pages in the page in the page table created by the bootloader are identity mapped except for the context switch code and the GDT.

@Freax13
Copy link
Member Author

Freax13 commented Jun 27, 2022

We use another very short lived page table to do the context switch. This page table would only map a few pages containing code that switches to the kernel's page table. Importantly, we would set the page table up in such a way that the kernel's entrypoint is just after the page table switch instruction, so we don't have to use any code to jump to the kernel, it would simply be the next instruction.

Interesting idea! However, AFAIK the kernels entry point address can be an arbitrary offset, e.g. in the middle of the .text section. So the memory before the entry point might already be used by other kernel code.

The short lived context switch page table wouldn't contain any entries from the kernel's page table, it'd just contain some entries to switch to the kernel's page table, so there's no way the two could overlap.

@phil-opp
Copy link
Member

Ah, I think I understand what you mean now. Assuming the kernel's entry point address is 0x2ec060. We would then map the context switch function in the temp page table in a way that it lives on the same virtual page as the entry point? We also offset it within the page so that the page table reload happens exactly at the instruction before 0x2ec060? Does this always work without violating any alignment requirements?

@Freax13
Copy link
Member Author

Freax13 commented Jun 27, 2022

Ah, I think I understand what you mean now. Assuming the kernel's entry point address is 0x2ec060. We would then map the context switch function in the temp page table in a way that it lives on the same virtual page as the entry point? We also offset it within the page so that the page table reload happens exactly at the instruction before 0x2ec060?

Yes, except that instead of mapping the context switch function, we might just write a the opcodes manually, I don't think we'll have to write many and it's probably easier/more reliable than making the function work when placed at a different address.

Does this always work without violating any alignment requirements?

Almost. I'm not aware of any alignment requirements that could cause problems, but there's another problem: This won't work if the entrypoint is placed right after the address space gap, the instruction pointer will not automatically jump the gap, so this will cause a GP. mov cr3, rax is a 3 byte instruction, so if the entrypoint is at 0xffff_8000_0000_0000, 0xffff_8000_0000_0001 or 0xffff_8000_0000_0002, this won't work. All other locations (including 0) should work fine though.

@phil-opp
Copy link
Member

if the entrypoint is at 0xffff_8000_0000_0000, 0xffff_8000_0000_0001 or 0xffff_8000_0000_0002, this won't work

I don't think that there are kernels that link their .text section right at the lower/upper half boundary. So that should not be a problem.

Yes, except that instead of mapping the context switch function, we might just write a the opcodes manually, I don't think we'll have to write many and it's probably easier/more reliable than making the function work when placed at a different address.

Sounds like it would be worth a try! So feel free to open a PR if you like, preferably against the next branch (I'm trying my best to finish the rewrite soon).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants