Skip to content

10. Two Level Paging Setup

Jose edited this page Jun 23, 2021 · 17 revisions

In this chapter, we will turn theory into practice. To be specific, we will go step by step to enabling two-level paging in our system. A functioning virtual memory system is indispensable for any further development of the kernel.

Main References of This Chapter

Scan through them before going forth:

Memory Layout Explanation

If we go back and revise our booting code src/boot/boot.s and our linker script scripts/kernel.ld that glues all the code and data sections together, we can clearly figure out the memory layout of our kernel image right after booting:

The kernel image is loaded into raw physical memory and there is no virtual memory system yet, so the addresses are just physical addresses. Let's assume that we want to support upto 128MiB of physical memory. For simplicity, our paging system will only support pages of size 4KiB, and the physical memory is divided into frames of size 4KiB.

Our goal after setting up paging is to have the kernel's virtual address space identity-mapped to the beginning of the physical memory and reserve some fixed amount of memory for the kernel heap. The rest of the physical memory will be free frames for mapping user process pages.

Be sure to understand where is the kernel's booting stack, where is the kernel heap, and how do they correspond to our code ✭. We will explain where we put the kernel stack of each process in other chapters.

Spoiler: each process will be allocated a (perhaps fixed-size) kernel stack somewhere in the kernel heap, and that requires a functioning heap memory allocation algorithm.

Memory Allocation Preparations

Before we start making page tables, we must come up with some basic solutions to allow memory allocation on the kernel heap. In particular, we will need two things: a temporary kernel memory allocator kalloc_temp(), and a free frame bitmap frame_bitmap to track the free/in-use state of each physical frame.

Temporary Kernel Heap Allocator

Our linker script exposes a variable bss_end marking the end of the .bss section (the uninitialized static variables section, which also holds the 16KiB space for the booting stack) of our kernel image. With the address of bss_end, we can know that all the .text, .data, .rodata, and .bss sections reside in memory region [1MiB, &bss_end). Above &bss_end is the kernel heap, and is where we put the content of page tables, the frame bitmap, etc.

Since we do not have a heap memory allocation algorithm at this time, we will use a temporary solution kalloc_temp() which simply maintains a pointer kheap_curr and grows the pointer at each allocation. Code @ src/memory/paging.c:

#include <stdint.h>
#include <stddef.h>
#include <stdbool.h>

#include "paging.h"

#include "../common/debug.h"
#include "../common/string.h"


/** Temporary solution: kernel heap bottom address. */
extern uint32_t bss_end;    /** Defined in linker script. */
uint32_t kheap_curr = (uint32_t) &bss_end;


/**
 * Auxiliary function for allocating (page-aligned) chunks of memory in the
 * kernel heap region that never gets freed.
 * 
 * Should only be used to allocate the kernel's page directory/tables and
 * the frames bitmap and other things before our actual heap allocation
 * algorithm setup.
 */
static uint32_t
_kalloc_temp(size_t size, bool page_align)
{
    /** If `page_align` is set, return an aligned address. */
    if (page_align && !ADDR_PAGE_ALIGNED(kheap_curr))
        kheap_curr = ADDR_PAGE_ROUND_UP(kheap_curr);

    /** If exceeds the 8MiB kernel memory boundary, panic. */
    if (kheap_curr + size > KMEM_MAX)
        error("kernel memory exceeds boundary");

    uint32_t temp = kheap_curr;
    kheap_curr += size;
    return temp;
}
// src/memory/paging.h

/** Extern resulted `kheap_curr` for heap allocator initialization. */
extern uint32_t kheap_curr;

This allows allocating bytes at the bottom of kernel heap, but it does not allow freeing any of the allocated bytes. Fortunately, everything we need to allocate in this chapter will never be freed because they are data structures that must exist throughout the kernel's lifetime. We will develop a complete heap memory allocation algorithm in the next chapter, after we have switched to the paging world.

Frame Allocation Bitmap

We also need a data structure for tracking which physical frames are free and which are in-use. The best match for this kind of task is a bitmap. We create an array of uint32_t's, where each integer element is essentially an array of 32 bits. Every bit tracks the in-use state of a physical frame: 0 means free and 1 means in use.

Add these implementations @ src/memory/paging.c:

/** Bitmap indicating free/used frames. */
static uint32_t *frame_bitmap;

/**
 * Helper functions for managing free physical frames, using a bitmap
 * data structure. Every bit indicates the free/used state of a corresponding
 * physical frame. Frame number one-one maps to bit index.
 */
#define BITMAP_OUTER_IDX(frame_num) (frame_num / 32)
#define BITMAP_INNER_IDX(frame_num) (frame_num % 32)

/** Set a frame as used. */
static inline void
frame_bitmap_set(uint32_t frame_num)
{
    size_t outer_idx = BITMAP_OUTER_IDX(frame_num);
    size_t inner_idx = BITMAP_INNER_IDX(frame_num);
    frame_bitmap[outer_idx] |= (1 << inner_idx);
}

/** Clear a frame as free. */
static inline void
frame_bitmap_clear(uint32_t frame_num)
{
    size_t outer_idx = BITMAP_OUTER_IDX(frame_num);
    size_t inner_idx = BITMAP_INNER_IDX(frame_num);
    frame_bitmap[outer_idx] &= ~(1 << inner_idx);
}

/** Returns true if a frame is in use, otherwise false. */
static inline bool
frame_bitmap_check(uint32_t frame_num)
{
    size_t outer_idx = BITMAP_OUTER_IDX(frame_num);
    size_t inner_idx = BITMAP_INNER_IDX(frame_num);
    return frame_bitmap[outer_idx] & (1 << inner_idx);
}

/**
 * Allocate a frame and mark as used. Returns the frame number of
 * the allocated frame, or panics if there is no free frame.
 */
static uint32_t
frame_bitmap_alloc(void)
{
    for (size_t i = 0; i < (NUM_FRAMES / 32); ++i) {
        if (frame_bitmap[i] == 0xFFFFFFFF)
            continue;
        for (size_t j = 0; j < 32; ++j) {
            if ((frame_bitmap[i] & (1 << j)) == 0) {
                /** Found a free frame. */
                uint32_t frame_num = i * 32 + j;
                frame_bitmap_set(frame_num);
                return i * 32 + j;
            }
        }
    }

    error("failed to find a free frame");
    return NUM_FRAMES;
}

Page Table Structures

Now, we can move on to define the page table entries structure (according to what the x86 32-bit MMU expects, see here). We will have a two-level page table.

  • Upper level is a 4KiB page directory, which contains 1024 page directory entries (PDEs), each 32bits long;
  • Every PDE, if valid, points to a level-2 4KiB page table, which contains 1024 page table entries (PTEs), each 32bits long;
  • Each PTE records the physical frame allocated for that page.

Given any 32-bit virtual address, it is easy to know how to locate its PTE:

32-bit VADDR := |31  PDE Index  22|21  PTE Index  12|11  Offset In Page  0|

Add these definitions @ src/memory/paging.h:

#ifndef PAGING_H
#define PAGING_H


#include <stdint.h>
#include <stdbool.h>


/** Number of physical frames available. Assume 128MiB physical memory. */
#define NUM_FRAMES 32768        /** 128MiB physical memory. */

/** Up to where is kernel memory, == the upper bound of kernel heap. */
#define KMEM_MAX 0x00800000     /** 8MiB reserved for the kernel. */

/** Assume 4KiB pages, not support any other sizes. */
#define PAGE_SIZE 4096

#define PTES_PER_PAGE 1024
#define PDES_PER_PAGE 1024


/**
 * Page table entry format, 32bits per entry. Order in struct
 * definition is from LSB -> MSB.
 * 
 * See https://wiki.osdev.org/Paging for the detailed definition.
 */
struct page_table_entry
{
    uint32_t present  :  1;     /** Set -> present in memory. */
    uint32_t writable :  1;     /** Set -> user writable. (read/write bit) */
    uint32_t user     :  1;     /** Set -> user accessible. */
    uint32_t unused0  :  2;     /** Unused 2 caching bits. */
    uint32_t accessed :  1;     /** Set -> accessed sinced mapped. */
    uint32_t dirty    :  1;     /** Set -> page has been written to. */
    uint32_t unused1  :  5;     /** Unused 5 misc bits. */
    uint32_t frame    : 20;     /** Physical frame number of the page. */
} __attribute__((packed));
typedef struct page_table_entry pte_t;

/**
 * Page directory entry format, 32bits per entry. Order in struct
 * definition is from LSB -> MSB.
 * 
 * See https://wiki.osdev.org/Paging for the detailed definition.
 */
struct page_directory_entry
{
    uint32_t present  :  1;     /** Set -> present in memory. */
    uint32_t writable :  1;     /** Set -> user writable. (read/write bit) */
    uint32_t user     :  1;     /** Set -> user accessible. */
    uint32_t unused0  :  2;     /** Unused 2 caching bits. */
    uint32_t accessed :  1;     /** Set -> accessed sinced mapped. */
    uint32_t unused1  :  1;     /** Unused bit. */
    uint32_t size     :  1;     /** 0 -> using 4KiB page size. */
    uint32_t unused2  :  4;     /** Unused 4 misc bits. */
    uint32_t frame    : 20;     /** Physical frame number of level-2 table. */
} __attribute__((packed));
typedef struct page_directory_entry pde_t;


#endif

It is also helpful to have some translation helper macros @ src/memory/paging.h:

/** Helper macros on addresses and page alignments. */
#define ADDR_PAGE_OFFSET(addr) (addr & 0x00000FFF)
#define ADDR_PAGE_NUMBER(addr) (addr >> 12)

#define ADDR_PDE_INDEX(addr) (ADDR_PAGE_NUMBER(addr) / 1024)
#define ADDR_PTE_INDEX(addr) (ADDR_PAGE_NUMBER(addr) % 1024)

#define ADDR_PAGE_ALIGNED(addr) (ADDR_PAGE_OFFSET(addr) == 0)

#define ADDR_PAGE_ROUND_DN(addr) (addr & 0xFFFFF000)
#define ADDR_PAGE_ROUND_UP(addr) (ADDR_PAGE_ROUND_DN(addr) + 0x1000)


/** Helper macro on getting the pointed-to address stored in an entry. */
#define ENTRY_FRAME_ADDR(entry) (entry.frame << 12)

The action of finding a PTE for a corresponding virtual address is often called walking the page directory/table. It is nice to have a function for doing that @ src/memory/paging.c:

/**
 * Walk a 2-level page table for a virtual address to locate its PTE.
 * If `alloc` is true, then when a level-2 table is needed but not
 * allocated yet, will perform the allocation.
 */
pte_t *
paging_walk_pgdir(pde_t *pgdir, uint32_t vaddr, bool alloc)
{
    size_t pde_idx = ADDR_PDE_INDEX(vaddr);
    size_t pte_idx = ADDR_PTE_INDEX(vaddr);

    /** If already has the level-2 table, return the correct PTE. */
    if (pgdir[pde_idx].present != 0) {
        pte_t *pgtab = (pte_t *) ENTRY_FRAME_ADDR((uint32_t) pgdir[pde_idx]);
        return &pgtab[pte_idx];
    }

    /**
     * Else, the level-2 table is not allocated yet. Do the allocation if
     * the alloc argument is set, otherwise return a NULL.
     */
    if (!alloc)
        return NULL;

    pte_t *pgtab = (pte_t *) _kalloc_temp(sizeof(pte_t) * PTES_PER_PAGE, true);
    memset(pgtab, 0, sizeof(pte_t) * PTES_PER_PAGE);

    pgdir[pde_idx].present = 1;
    pgdir[pde_idx].writable = 0;
    pgdir[pde_idx].user = 0;
    pgdir[pde_idx].frame = ADDR_PAGE_NUMBER((uint32_t) pgtab);

    return &pgtab[pte_idx];
}

Switching to the Paging World

With the mechanisms ready, we can now allocate space for the kernel's page directory, identity-map all the kernel-reserved memory pages, and finally, tell the underlying x86 hardware to switch to paging mode.

We group these into an initialization function to be called at booting @ src/memory/paging.c:

/**
 * Pointer to current active page directory, may be the kernel's page
 * directory or the current running process's page directory.
 */
static pde_t *active_pgdir;
static pde_t *kernel_pgdir;     /** Allocated at paging init. */


/** Switch the current page directory to the given one. */
void
paging_switch_pgdir(pde_t *pgdir)
{
    active_pgdir = pgdir;
    asm volatile ( "movl %0, %%cr3" : : "r" (pgdir) );
}

/** Initialize paging and switch to use paging. */
void
paging_init(void)
{
    /**
     * The frame bitmap also needs space, so allocate space for it in
     * our kernel heap. Clear it to zeros.
     */
    frame_bitmap = (uint32_t *) _kalloc_temp(NUM_FRAMES / 32, false);
    memset(frame_bitmap, 0, NUM_FRAMES / 32);

    /**
     * Allocate the one-page space for the kernel's page directory in
     * the kernel heap. All pages of page directory/tables must be
     * page-aligned.
     */
    kernel_pgdir = (pde_t *) _kalloc_temp(sizeof(pde_t) * PDES_PER_PAGE, true);
    memset(kernel_pgdir, 0, sizeof(pde_t) * PDES_PER_PAGE);
    active_pgdir = kernel_pgdir;

    /**
     * Identity-map the kernel's virtual address space to the physical
     * memory. This means we need to map all the allowed kernel physical
     * frames (from 0 -> KMEM_MAX) as its identity virtual address in
     * the kernel page table.
     */
    uint32_t addr = 0;
    while (addr < KMEM_MAX) {
        uint32_t frame_num = frame_bitmap_alloc();
        pte_t *pte = paging_walk_pgdir(kernel_pgdir, addr, true);

        /** Update the bits in this PTE. */
        pte->present = 1;
        pte->writable = 0;
        pte->user = 0;
        pte->frame = frame_num;

        addr += PAGE_SIZE;
    }

    /**
     * Register the page fault handler. This acation must be done before
     * we do the acatual switch towards using paging.
     */
    isr_register(INT_NO_PAGE_FAULT, page_fault_handler);
                 // 14, add macro definition in `src/interrupt/isr.h`

    /** Load the address of kernel page directory into CR3. */
    paging_switch_pgdir(kernel_pgdir);

    /**
     * Enable paging by setting the two proper bits of CR0:
     *   - PG bit (31): enable paging
     *   - PE bit (0): enable protected mode
     *   
     * We are not setting the WP bit, so the read/write bit of any PTE just
     * controls whether the page is user writable - in kernel priviledge any
     * page can be written.
     */
    uint32_t cr0;
    asm volatile ( "movl %%cr0, %0" : "=r" (cr0) : );
    cr0 |= 0x80000001;
    asm volatile ( "movl %0, %%cr0" : : "r" (cr0) );
}

Of course, don't forget to export the function definitions:

// src/memory/paging.h

void paging_init();

pte_t *paging_walk_pgdir(pde_t *pgdir, uint32_t vaddr, bool alloc);

void paging_switch_pgdir(pde_t *pgdir);

Note the order in which we do things ✭:

  1. Allocate and initialize the frame bitmap and the kernel page directory
  2. Map all the pages in the kernel-reserved 8MiB region to their identical address on physical memory (so 0 - 8MiB on physical memory)
  3. Register the page fault handler (will come to this soon)
  4. Switch the active page directory to the kernel's page directory by setting the value of control register CR3
  5. Finally, enable paging and enter protected mode at the same time by setting the corresponding two flags of control register CR0

Handling Page Faults

There is one last thing we need to take care of. After switching to paging mode, the hardware MMU issues a page fault exception (on x86, the interrupt # 14) whenever an access to an unmapped virtual address happens. The OS must provide a handler for that interrupt and decide what to do with this access, either allocating a frame for a valid access or kill the offending process for an invalid access.

For now, we write a basic page fault handler @ src/memory/paging.c (and register it in paging_init(), as shown in the last section):

/** Page fault (ISR # 14) handler. */
static void
page_fault_handler(interrupt_state_t *state)
{
    /** The CR2 register holds the faulty address. */
    uint32_t vaddr;
    asm ( "movl %%cr2, %0" : "=r" (vaddr) : );

    /**
     * Analyze the least significant 3 bits of error code to see what
     * triggered this page fault:
     *   - bit 0: page present -> 1, otherwise 0
     *   - bit 1: is a write operation -> 1, read -> 0
     *   - bit 2: is from user mode -> 1, kernel -> 0
     *
     * See https://wiki.osdev.org/Paging for more.
     */
    bool present = state->err_code & 0x1;
    bool write   = state->err_code & 0x2;
    bool user    = state->err_code & 0x4;

    /** Just prints an information message for now. */
    info("Caught page fault {\n"
         "  faulty addr = %#X\n"
         "  present: %d\n"
         "  write:   %d\n"
         "  user:    %d\n"
         "}", vaddr, present, write, user);

    /**
     * Temporary incomplete handler logic: alloc a frame for the faulty
     * page if it is the kernel trying to read a faulty address which
     * is not present.
     */
    // if ((!present) && (!user)) {
    //     pte_t *pte = paging_walk_pgdir(kernel_pgdir, vaddr, true);

    // }
    panic("page fault not handled!");
}

Upon a page fault, the CR2 register will hold the virtual address value that trigger the fault, and the err_code field in the interrupt state contains three flags that explain the reason of the fault. The handler should then react accordingly to different combinations of the three flags. For now, we are just printing the information out, since we have no user process support yet.

Progress So Far

Let's try out and see how our kernel behaves in paging mode and how it reacts to a page fault! Add these to the main function @ src/kernel.c:

#include "memory/paging.h"


/** The main function that `boot.s` jumps to. */
void
kernel_main(unsigned long magic, unsigned long addr)
{
    ... // All the previous initializations...

    /** Initialize paging and switch to use paging. */
    _init_message("setting up virtual memory using paging");
    paging_init();
    _init_message_ok();
    info("supporting physical memory size: %3dMiB", NUM_FRAMES * 4 / 1024);
    info("reserving memory for the kernel: %3dMiB", KMEM_MAX / 1024 / 1024);

    /** Executes `sti`, CPU starts taking in interrupts. */
    _enable_interrupts();

    uint32_t faulty_addr = 0xBCDE0123;

    printf("\nKernel reading an unmapped address!\n");
    int dummy = *((int *) faulty_addr);
    printf("%d\n", dummy);

    while (1)   // CPU idles with a `hlt` loop.
        asm volatile ( "hlt" );
}

The _init_message*() functions are just printing some progress messages. Don't forget to comment out the dummy printf("."); in our current timer interrupt handler.

This should produce a terminal window as the following after booting up:

Current repo structure:

├── Makefile
├── scripts
│   ├── gdb_init
│   ├── grub.cfg
│   └── kernel.ld
├── src
│   ├── boot
│   │   ├── boot.s
│   │   ├── elf.h
│   │   └── multiboot.h
│   ├── common
│   │   ├── debug.c
│   │   ├── debug.h
│   │   ├── port.c
│   │   ├── port.h
│   │   ├── printf.c
│   │   ├── printf.h
│   │   ├── string.c
│   │   ├── string.h
│   │   ├── types.c
│   │   └── types.h
│   ├── device
│   │   ├── keyboard.c
│   │   ├── keyboard.h
│   │   ├── timer.c
│   │   └── timer.h
│   ├── display
│   │   ├── terminal.c
│   │   ├── terminal.h
│   │   └── vga.h
│   ├── interrupt
│   │   ├── idt-load.s
│   │   ├── idt.c
│   │   ├── idt.h
│   │   ├── isr-stub.s
│   │   ├── isr.c
│   │   └── isr.h
│   ├── memory
│   │   ├── gdt-load.s
│   │   ├── gdt.c
│   │   ├── gdt.h
│   │   ├── paging.c
│   │   └── paging.h
│   └── kernel.c