StuBS
|
The goal of the BST lecture is to introduce vertical and horizontal isolation to protect the RStuBS kernel from applications and also applications from each other. In this first assignment, we focus on vertical isolation (horizontal isolation will be the topic of the next assignment).
Vertical isolation protects the kernel from the user by preventing it from using certain privileged instructions (like cli
/sti
). For this, CPUs contain multiple execution modes: a kernel mode (ring 0) with the whole instructions set and a user mode (ring 3) where some instructions are not available. In RStuBS, we have to implement this mode- or ring-switch manually.
For specific privileged operations, we will also implement a syscall API.
The x86 (IA32) architecture specifies four privilege levels (rings), but we only require ring 0 (kernel) and ring 3 (user). In the first step, the system should be adapted so that the code of the applications is always executed on ring 3 and only the handling of interrupts (especially time-slice scheduling interrupts) takes place on the privileged ring 0. Only in the second step, an interface for system calls is introduced, which allows for a synchronous entry into the kernel to execute privileged operations.
These privilege levels (rings) are directly related to segmentation on x86. The "Global Descriptor Table" (GDT) contains a list of segments that specify a memory region and a corresponding privilege level (ring). The CPU uses these segments for all memory accesses. For this, the CPU has specific registers: the code segment (CS), stack segment (SS), and data segment (DS). They are used implicitly to perform the corresponding memory accesses. An access into a segment is only valid if the pointer is smaller than its size and the CPU has the correct privilege level.
The Intel manual differentiates between three Privilege Levels:
Depending on the segment, the access is only valid if CPL <= RPL <= DPL
(DS) or CPL == RPL == DPL
(CS, SS).
For operation in ring 3, we have to add two new entries to the GDT
. They define the code and data/stack segments for the user mode. For now, we configure them with access to the entire address space (this will be tackled in the next assignment).
Now we have enabled the user mode, but how do we switch between Rings?
We generally have three different ring switches:
Switch (1) and (2) have to be implemented manually, and switch (3) is done automatically by the CPU as part of iretd
.
The third new GDT entry, the task state segment Descriptor
, has to be added for the switch from user to kernel during interrupt handling. Usually, the CPU automatically switches to ring 0 if an interrupt is triggered. However, we have one problem: the stack. We generally do not want our interrupt handlers to rely on the user stack. Thus, the stack pointer (ESP) and segment (SS) must be switched to a separate kernel stack before the IRQ and restored back to the user stack afterward. Configuring the interrupt stack requires the use of x86 hardware tasks as a workaround. These are configured with a TaskStateSegment
. This TSS provides the kernel stack pointer (ESP0) and stack segment (SS0) for ring 0 when an interrupt occurs.
To summarize:
TaskStateSegment
(we only need SS0 and ESP0)ltr
) during bootNote: The structures of these descriptors are described in the third volume of the three-part IA-32 Developer's Guide in the sections "Segment Descriptors" (3.4.5) and "Task Management Data Structures" (7.2.2).
Now, how do we use this TaskStateSegment
?
The goal is to have unique kernel and user stacks for each thread. The Thread
already contains a kernel stack pointer, so you only have to add a new user stack to user threads.
We also must ensure that each application runs its ring 0 code its own kernel stack. Still, x86 does not switch the stack automatically. Therefore, during a context switch (in the Dispatcher
), we have ensure that the next interrupt uses the kernel stack of the next thread.
Every time we switch from the kernel to the user-space, we have to drop our privileges such that the user cannot monopolize the CPU and the OS always takes back control (e.g., via the timer interrupt). While the hardware helps us with the switch from and to ring 3 when we switch threads via (timer) interrupts, we have to take special provisions when we start a new thread:
When dispatching to a new thread, we have to leave ring 0. For this, you have to extend the prepareContext()
method. Originally, this method does prepare a thread context as if the thread ran before and was just preempted by the kernel. It fakes the thread control block and the stack so that it calls the Thread::kickoff()
function.
Instead of calling the virtual Thread::action()
method, you have to perform the jump to ring 3 by preparing a fake stack that looks like it was created by a hardware interrupt that jumped from ring 3 to ring 0. With this faked stack, you invoke the iret
instruction, which reverts this privilege increase and thereby brings us to ring 3. This iret
should jump to a newly introduced kickoff_user()
trampoline function, which always executes on ring 3 and invokes the Thread::action()
method.
A description of the faked interrupt stack can be found in the Intel handbook under "Exception and Interrupt Handling" (6.12). Besides that, you also have to set the Segment-Registers (DS, ES, FS, and GS) to the correct user-space segments. Passing parameters to kickoff_user()
, can be done by pushing it onto the user-space thread, as prepareContext()
does it for Thread::kickoff()
.
Now is a good point to test and debug your ring switch. Disable your timer for now and attach GDB to step through your implementation (make qemu-serial-gdb
+ make connect-gdb
). Then put some breakpoints to the user action functions and the kickoff_user()
function to see if it works as expected. You can check your CPU registers and segments (including rings) with monitor info registers
. It might also be helpful to see the assembler layout asm
when stepping through the code.
Now that we have left ring 0 successfully, we have to open up a way for the application to execute operations securely on ring 3. For this, we provide a synchronous path from ring 3 to ring 0. On top of this, we build our system-call interface.
While interrupts are asynchronous, x86 CPUs give us the possibility to trigger traps from software with the int
instruction. However, triggering an interrupt is usually a privileged operation as the user could tick the system time very fast by triggering the timer interrupt in a while() { int $timer_irq; }
loop very fast. Therefore, is x86's int
instruction a privileged operation that, if invoked on ring 3, would provoke a General Protection Fault
.
However, by configuring an entry in the Interrupt Descriptor Table, we can allow the int
instruction for individual interrupt vectors (e.g., for our system-call trap number) from user space. The format of those descriptors in explained in the Intel manual in section 6.12.
As the system-call trap is not triggered by an external device, we must adapt the system-entrance path for the chosen interrupt vector In interrupt/handler.asm
, you have to save all registers (caller- and callee saved) in a well-defined order such that we can access the system-call parameters in our C++ code. For this, you should extend the CPU context structure. (InterruptContext
). With this adaption, you can plug in your system call path into the interrupt_handler()
function.
Check the Intel Manual's "Exception and Interrupt Handling" section for more information.
Since a system call provokes a switch to the kernel stack, the user cannot pass parameters on the (user) stack. Instead, we have to provide stub functions on the user side that load the system-call parameters into the CPU registers (eax
, ecx
, edx
, ebx
, esi
, edi
) to pass them, over the privilege switch, onto ring 0. On the other side, the interrupt_entry
function must store the those registers on the kernel stack to make them accessible for C/C++ land. The system-call dispatcher uses one argument to determine the system call (e.g., system-call number in eax
).
Return values should also be passed in registers. There are multiple ways to return and distinguish an error from a correct return value. The errors can be encoded as special values, put in an extra register, or the carry flag can be used to distinguish an error code from a normal return value. You can implement a method of your choice.
The following system calls are to be implemented by you. The concrete semantic can be adapted in a meaningful manner. Validate the user provided syscall parameters on the kernel side and return error codes.
It is a good idea to hide the write
syscall behind an OutputStream
compatible wrapper. For easier debugging, it is recommended to create CPP macros for assertions and kernel panics, which can show the error location using __LINE__
, __FILE__
and __func__
variables.
Tips:
mov
, push
, pop
, pushfd
, and int
.cli
in an app create a trap?