The goal of the BST lecture is to introduce vertical and horizontal isolation to protect the RStuBS kernel from applications and also applications from each other. In this first assignment, we focus on vertical isolation (horizontal isolation will be the topic of the next assignment).

Vertical isolation protects the kernel from the user by preventing it from using certain privileged instructions (like cli/sti). For this, CPUs contain multiple execution modes: a kernel mode (ring 0) with the whole instructions set and a user mode (ring 3) where some instructions are not available. In RStuBS, we have to implement this mode- or ring-switch manually.

For specific privileged operations, we will also implement a syscall API.

Moving the application to ring 3

The x86 (IA32) architecture specifies four privilege levels (rings), but we only require ring 0 (kernel) and ring 3 (user). In the first step, the system should be adapted so that the code of the applications is always executed on ring 3 and only the handling of interrupts (especially time-slice scheduling interrupts) takes place on the privileged ring 0. Only in the second step, an interface for system calls is introduced, which allows for a synchronous entry into the kernel to execute privileged operations.

Privilege Levels and the GDT

These privilege levels (rings) are directly related to segmentation on x86. The "Global Descriptor Table" (GDT) contains a list of segments that specify a memory region and a corresponding privilege level (ring). The CPU uses these segments for all memory accesses. For this, the CPU has specific registers: the code segment (CS), stack segment (SS), and data segment (DS). They are used implicitly to perform the corresponding memory accesses. An access into a segment is only valid if the pointer is smaller than its size and the CPU has the correct privilege level.

The Intel manual differentiates between three Privilege Levels:

CPL: Current Privilege Level (in current CS of the CPU)
RPL: Requestor Privilege Level (in DS, SS of the CPU or instruction)
DPL: Descriptor Privilege Level (in GDT descriptor)

Depending on the segment, the access is only valid if CPL <= RPL <= DPL (DS) or CPL == RPL == DPL (CS, SS).

For operation in ring 3, we have to add two new entries to the GDT. They define the code and data/stack segments for the user mode. For now, we configure them with access to the entire address space (this will be tackled in the next assignment).

Switching Between Rings

Now we have enabled the user mode, but how do we switch between Rings?

We generally have three different ring switches:

The initial switch to ring 3 for an application
The switch to ring 0 for interrupts and syscalls
The switch back to ring 3 after an interrupt or syscall

Switch (1) and (2) have to be implemented manually, and switch (3) is done automatically by the CPU as part of iretd.

The third new GDT entry, the task state segment Descriptor, has to be added for the switch from user to kernel during interrupt handling. Usually, the CPU automatically switches to ring 0 if an interrupt is triggered. However, we have one problem: the stack. We generally do not want our interrupt handlers to rely on the user stack. Thus, the stack pointer (ESP) and segment (SS) must be switched to a separate kernel stack before the IRQ and restored back to the user stack afterward. Configuring the interrupt stack requires the use of x86 hardware tasks as a workaround. These are configured with a TaskStateSegment. This TSS provides the kernel stack pointer (ESP0) and stack segment (SS0) for ring 0 when an interrupt occurs.

To summarize:

Create a new TSS descriptor
Implement the layout of the TaskStateSegment (we only need SS0 and ESP0)
Load the TSS (ltr) during boot

‍Note: The structures of these descriptors are described in the third volume of the three-part IA-32 Developer's Guide in the sections "Segment Descriptors" (3.4.5) and "Task Management Data Structures" (7.2.2).

User/Kernel Stack Implementation

Now, how do we use this TaskStateSegment?

The goal is to have unique kernel and user stacks for each thread. The Thread already contains a kernel stack pointer, so you only have to add a new user stack to user threads.

We also must ensure that each application runs its ring 0 code its own kernel stack. Still, x86 does not switch the stack automatically. Therefore, during a context switch (in the Dispatcher), we have ensure that the next interrupt uses the kernel stack of the next thread.

Initial dispatch to ring 3

Every time we switch from the kernel to the user-space, we have to drop our privileges such that the user cannot monopolize the CPU and the OS always takes back control (e.g., via the timer interrupt). While the hardware helps us with the switch from and to ring 3 when we switch threads via (timer) interrupts, we have to take special provisions when we start a new thread:

When dispatching to a new thread, we have to leave ring 0. For this, you have to extend the prepareContext() method. Originally, this method does prepare a thread context as if the thread ran before and was just preempted by the kernel. It fakes the thread control block and the stack so that it calls the Thread::kickoff() function.

Instead of calling the virtual Thread::action() method, you have to perform the jump to ring 3 by preparing a fake stack that looks like it was created by a hardware interrupt that jumped from ring 3 to ring 0. With this faked stack, you invoke the iret instruction, which reverts this privilege increase and thereby brings us to ring 3. This iret should jump to a newly introduced kickoff_user() trampoline function, which always executes on ring 3 and invokes the Thread::action() method.

A description of the faked interrupt stack can be found in the Intel handbook under "Exception and Interrupt Handling" (6.12). Besides that, you also have to set the Segment-Registers (DS, ES, FS, and GS) to the correct user-space segments. Passing parameters to kickoff_user(), can be done by pushing it onto the user-space thread, as prepareContext() does it for Thread::kickoff().

Testing & Debugging

Now is a good point to test and debug your ring switch. Disable your timer for now and attach GDB to step through your implementation (make qemu-serial-gdb + make connect-gdb). Then put some breakpoints to the user action functions and the kickoff_user() function to see if it works as expected. You can check your CPU registers and segments (including rings) with monitor info registers. It might also be helpful to see the assembler layout asm when stepping through the code.

System-Call Interface

Now that we have left ring 0 successfully, we have to open up a way for the application to execute operations securely on ring 3. For this, we provide a synchronous path from ring 3 to ring 0. On top of this, we build our system-call interface.

Exception handling

While interrupts are asynchronous, x86 CPUs give us the possibility to trigger traps from software with the int instruction. However, triggering an interrupt is usually a privileged operation as the user could tick the system time very fast by triggering the timer interrupt in a while() { int $timer_irq; } loop very fast. Therefore, is x86's int instruction a privileged operation that, if invoked on ring 3, would provoke a General Protection Fault.

However, by configuring an entry in the Interrupt Descriptor Table, we can allow the int instruction for individual interrupt vectors (e.g., for our system-call trap number) from user space. The format of those descriptors in explained in the Intel manual in section 6.12.

As the system-call trap is not triggered by an external device, we must adapt the system-entrance path for the chosen interrupt vector In interrupt/handler.asm, you have to save all registers (caller- and callee saved) in a well-defined order such that we can access the system-call parameters in our C++ code. For this, you should extend the CPU context structure. (InterruptContext). With this adaption, you can plug in your system call path into the interrupt_handler() function.

‍Check the Intel Manual's "Exception and Interrupt Handling" section for more information.

Passing parameters

Since a system call provokes a switch to the kernel stack, the user cannot pass parameters on the (user) stack. Instead, we have to provide stub functions on the user side that load the system-call parameters into the CPU registers (eax, ecx, edx, ebx, esi, edi) to pass them, over the privilege switch, onto ring 0. On the other side, the interrupt_entry function must store the those registers on the kernel stack to make them accessible for C/C++ land. The system-call dispatcher uses one argument to determine the system call (e.g., system-call number in eax).

Return values should also be passed in registers. There are multiple ways to return and distinguish an error from a correct return value. The errors can be encoded as special values, put in an extra register, or the carry flag can be used to distinguish an error code from a normal return value. You can implement a method of your choice.

The following system calls are to be implemented by you. The concrete semantic can be adapted in a meaningful manner. Validate the user provided syscall parameters on the kernel side and return error codes.

 
int write(int fd, const void *buf, size_t len, int x = -1, int y = -1)
int read(int fd, void *buf, size_t len)
int sleep(size_t ms)
int sem_init(int semid, int value)
int sem_destroy(int semid)
int sem_wait(int semid)
int sem_signal(int semid)

It is a good idea to hide the write syscall behind an OutputStream compatible wrapper. For easier debugging, it is recommended to create CPP macros for assertions and kernel panics, which can show the error location using __LINE__, __FILE__ and __func__ variables.

Tips:

Put all your user syscall-stubs in a single separate file (we separate them in the next assignment from the kernel).
For more information about inline assembly see https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
You probably only need mov, push, pop, pushfd, and int.

Checklist

All segment descriptors must be defined correctly
Faked the user stack in the kickoff function
Use the correct stack for interrupts
Specific interrupt vector for syscalls
Saved registers before syscall (clobbers)
Convert registers back to arguments in the kernel syscall handler
Test: Does cli in an app create a trap?