Module b1_syscalls

Help

Expand description

§BST - A1: System Calls

The goal of the BST lecture is to introduce vertical and horizontal isolation to protect the RStuBS kernel from applications and also applications from each other. In this first assignment, we focus on vertical isolation (horizontal isolation will be the topic of the next assignment).

Vertical isolation protects the kernel from the user by preventing it from using certain privileged instructions (like cli/sti). For this, CPUs contain multiple execution modes: a kernel mode (ring 0) with the whole instructions set and a user mode (ring 3) where some instructions are not available. In RStuBS, we have to implement this mode- or ring-switch manually.

For specific privileged operations, we will also implement a syscall API.

§Moving Applications to Ring 3

The x86 (IA32) architecture specifies four privilege levels (rings), but we only require ring 0 (kernel) and ring 3 (user). In the first step, the system should be adapted so that the code of the applications is always executed on ring 3 and only the handling of interrupts (especially time-slice scheduling interrupts) takes place on the privileged ring 0. Only in the second step, an interface for system calls is introduced, which allows for a synchronous entry into the kernel to execute privileged operations.

§Privilege Levels and the GDT

These privilege levels (rings) are directly related to segmentation on x86. The “Global Descriptor Table” (GDT) contains a list of segments that specify a memory region and a corresponding privilege level (ring). The CPU uses these segments for all memory accesses. For this, the CPU has specific registers: the code segment (CS), stack segment (SS), and data segment (DS). They are used implicitly to perform the corresponding memory accesses. An access into a segment is only valid if the pointer is smaller than its size and the CPU has the correct privilege level.

The manual differentiates between three Privilege Levels:

CPL: Current Privilege Level (in current CS of the CPU)
RPL: Requestor Privilege Level (in DS, SS of the CPU or instruction)
DPL: Descriptor Privilege Level (in GDT descriptor)

Depending on the segment, the access is only valid if CPL <= RPL <= DPL (DS) or CPL == RPL == DPL (CS, SS).

For operation in ring 3, we have to add two new entries to the GDT. They define the code and data/stack segments for the user mode. For now, we configure them with access to the entire address space (this will be tackled in the next assignment).

§Switching Between Rings

Now we have enabled the user mode, but how do we switch between Rings?

We generally have three different ring switches:

The initial switch to ring 3 for an application
The switch to ring 0 for interrupts and syscalls
The switch back to ring 3 after an interrupt or syscall

Switch (1) and (2) have to be implemented manually, and switch (3) is done automatically by the CPU as part of iretd.

The third new GDT entry, the Task State Segment Descriptor, has to be added for the switch from user to kernel during interrupt handling. Usually, the CPU automatically switches to ring 0 if an interrupt is triggered. However, we have one problem: the stack. We generally do not want our interrupt handlers to rely on the user stack. Thus, the stack pointer (esp) and segment (SS) must be switched to a separate kernel stack before the IRQ and restored back to the user stack afterward. Configuring the interrupt stack requires the use of x86 hardware tasks as a workaround. These are configured with a TaskStateSegment. This TSS provides the kernel stack pointer (esp0) and stack segment (ss0) for ring 0 when an interrupt occurs.

To summarize:

Create a new TSS descriptor
Implement the layout of the TaskStateSegment (we only need ss0 and esp0)
Load the TSS (load_tss) during boot

Note: The structures of these descriptors are described in the third volume of the three-part IA-32 Developer’s Guide in the sections “Segment Descriptors” (3.4.5) and “Task Management Data Structures” (7.2.2).

§User/Kernel Stack Implementation

Now, how do we use this TaskStateSegment?

The goal is to have unique kernel and user stacks for each thread. The Thread already contains a kernel stack pointer, so you only have to add a new user stack to user threads.

We also must ensure that each application runs its ring 0 code its own kernel stack. Still, x86 does not switch the stack automatically. Therefore, during a context switch (in Scheduler::dispatch), we have ensure that the next interrupt uses the kernel stack of the next thread.

§Initial Switch to Ring 3

Generally, x86 has multiple ways to perform the initial switch from kernel to user mode. The option we choose is to fake a return from an interrupt, which automatically switches rings.

When dispatching to a new thread, we have to leave ring 0. For this, you have to extend the Thread::init method. Originally, this method does prepare a thread context as if the thread ran before and was just preempted by the kernel. It fakes the thread control block and the stack so that it calls the Thread::kernel_kickoff function.

Instead of calling the action function directly, you have to perform the jump to ring 3 by preparing a fake stack that looks like it was created by a hardware interrupt that jumped from ring 3 to ring 0 (use switch_to_user). With this faked stack, you invoke the iretd instruction, which reverts this privilege increase and thereby brings us to ring 3. This iretd should jump to action, which is then executed in ring 3. As we also want to switch to the user stack, you have to pass it into Thread::kernel_kickoff by also putting it on the kernel stack.

A description of the faked interrupt stack can be found in the Intel handbook under “Exception and Interrupt Handling” (6.12). Besides that, you also have to set the Segment-Registers (DS, ES, FS, and GS) to the correct user data segment. This has to be done only once, as ring 0 can access the ring 3 data segment, so you could also do this in gtd::init.

§Testing & Debugging

Now is a good point to test and debug your ring switch. Disable your timer for now and attach GDB to step through your implementation (cargo run-gdb + cargo gdb). Then put some breakpoints to the user action functions and the switch_to_user function to see if it works as expected. You can check your CPU registers and segments (including rings) with monitor info registers. It might also be helpful to see the assembler layout asm when stepping through the code. Try executing cli in a user application, which should fail (general protection fault).

§System-Call Interface

Now that we have left ring 0 successfully, we have to open up a way for the application to execute operations securely on ring 3. For this, we provide a synchronous path from ring 3 to ring 0. On top of this, we build our system-call interface.

In this first assignment, we want to start with the following syscalls. The next assignments will introduce more.

/// The errors, syscalls can return
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(usize)]
pub enum Error {
    /// Invalid syscall id
    Id,
    /// Invalid syscall argument
    Argument,
    /// Resource cannot be initialized, it most likely was already initialized
    Init,
    /// Resource cannot be found
    NotFound,
    // ...
}

/// Write `buf` to the screen/serial based on `fd`. Accepts an optional screen position.
pub fn write(fd: usize, buf: &[u8], pos: Option<(u8, u8)>) -> Result<usize, Error> {}
/// Read from the keyboard.
pub fn read(fd: usize, buf: &mut [u8]) -> Result<usize, Error> {}
/// Put the Thread to sleep for `ms` milliseconds.
pub fn sleep(ms: usize) -> Result<(), Error> {}
/// Create a new semaphore with the given id or fails if the id is already used.
pub fn sem_init(id: usize, value: usize) -> Result<(), Error> {}
/// Destroy a semaphore or fail if it is not initialized.
pub fn sem_destroy(id: usize) -> Result<(), Error> {}
/// Wait on a semaphore or fail if it is not initialized.
pub fn sem_wait(id: usize) -> Result<(), Error> {}
/// Wake up other threads or fail if the id is not initialized.
pub fn sem_signal(id: usize) -> Result<(), Error> {}

Recommendation: Create println! macro and a BufWriter that calls write internally. To make your life easier, we provide an example below:

BufWriter for write

/// Buffered writer, that collects output until its buffer is full.
/// It then flushes the buffer with the write syscall.
pub struct BufWriter {
    fd: usize,
    buf: [u8; Self::LEN],
    idx: usize,
    pos: Option<(u8, u8)>,
}
impl BufWriter {
    pub const LEN: usize = 64;
    /// Create a new writer for the given file and optional screen position on the CGA.
    pub const fn new(fd: usize, pos: Option<(u8, u8)>) -> Self {
        Self {
            fd,
            buf: [0; Self::LEN],
            idx: 0,
            pos,
        }
    }
    /// Call the write syscall and empty the buffer.
    pub fn flush(&mut self) {
        write(self.fd, &self.buf[..self.idx], self.pos).unwrap();
        self.idx = 0;
    }
}
impl Drop for BufWriter {
    fn drop(&mut self) {
        self.flush();
    }
}
impl fmt::Write for BufWriter {
    fn write_str(&mut self, s: &str) -> fmt::Result {
        let mut s = s.as_bytes();
        while !s.is_empty() {
            let remainder = (Self::LEN - self.idx).min(s.len());
            self.buf[self.idx..self.idx + remainder].copy_from_slice(&s[..remainder]);
            self.idx += remainder;
            s = &s[remainder..];
            if !s.is_empty() {
                self.flush();
                self.idx = 0;
            }
        }
        Ok(())
    }
}

/// The print macro creates an wrapper around rusts formatting and IO
/// ```
/// // print to a screen position
/// println!(0, 3: "Hello World!");
/// // print to the kout screen
/// println!("Hello World!");
/// // print to the dbg screen
/// println!(dbg: "Hello World!");
/// ```
#[macro_export]
macro_rules! println {
    ($x:tt, $y:tt : $($arg:tt)*) => ({
        use core::fmt::Write;
        let _ = writeln!($crate::BufWriter::new(0, Some(($x, $y))), $($arg)*);
    });
    (dbg: $($arg:tt)*) => ({
        use core::fmt::Write;
        let _ = writeln!($crate::BufWriter::new(2, None), $($arg)*);
    });
    (serial: $($arg:tt)*) => ({
        use core::fmt::Write;
        let _ = writeln!($crate::BufWriter::new(3, None), $($arg)*);
    });
    ($($arg:tt)*) => ({
        use core::fmt::Write;
        let _ = writeln!($crate::BufWriter::new(1, None), $($arg)*);
    });
}

§Triggering Syscall Interrupts (User)

The syscall interface consists of two parts:

A user syscall-stub, triggering the syscall after preparing the parameters
A system syscall-handler, handling the syscall and accepting the parameters

We start with the user part: While interrupts are asynchronous, x86 CPUs give us the possibility to trigger traps from software with the int instruction (at least for custom interrupt vectors above 32). Since a syscall (interrupt) provokes a switch to the kernel stack, we cannot pass parameters on the (user) stack. Instead, we want to pass syscall parameters (and the syscall number) in registers (EAX, ECX, EDX, EDX, EBX, EDI). Therefore, the user syscall-stub has to put the parameter in the correct registers before calling int.

Return values should also be passed in registers. There are multiple ways to return and distinguish an error from a correct return value. The errors can be encoded as special values, put in an extra register, or the carry flag can be used to distinguish an error code from a normal return value. You can implement a method of your choice.

Tips:

Note that we return Results, so your wrapper should convert from/to the integers in the registers accordingly.
Put all your user syscall-stubs in a single separate rust file (we separate them in the next assignment from the kernel).
For more information about inline assembly see https://doc.rust-lang.org/reference/inline-assembly.html
You probably only need mov, push, pop, pushfd, and int.

§Handling Syscall Interrupts (Kernel)

The other side of the syscall mechanism is the kernel handler. Triggering a trap via the int instruction is usually a privileged operation, not allowed for ring 3 code, and would result in a general protection fault. However, interrupt descriptor table entries for custom vectors can be configured to allow activation from the user (via int). Create a new IDT entry for the syscall vector, calling the syscall_trampoline function.

The syscall-handler must do the opposite of the syscall-stub and push the syscall parameters from the registers to the kernel stack. Since we need access to the additional registers, we have to write our own trampoline function (with assembly) that pushes the parameters to the stack, on top of the context that is already saved by the CPU during interrupt handling (InterruptStack). The SyscallStack is used to give the syscall-handler access to the parameters and the other context on the stack.

This also has to be implemented with inline-assembly in syscall_trampoline, using the SyscallStack. Also, do not forget to pop the registers from the stack again after executing a syscall. The syscall-trampoline should call the syscall-handler function. This function then selects the actual implementation based on the syscall number, executes it, puts the result into the registers (SyscallStack), and returns to the user.

Note: Validate the user provided syscall parameters on the kernel side!

Check the Intel Manual’s “Exception and Interrupt Handling” section for more information.

§Checklist

All segment descriptors must be defined correctly
Faked the user stack in the kickoff function
Use the correct stack for interrupts
Specific interrupt vector for syscalls
Saved registers before syscall (clobbers)
Convert registers back to arguments in the kernel syscall handler
Test: Does cli in an app create a trap?

Module b1_syscallsCopy item path