xv6a simple, Unix-like teaching operating system
Russ Cox
Frans Kaashoek
Robert Morris
xv6-book@pdos.csail.mit.edu
Draft as of August 29, 2017
Contents
0
Operating system interfaces
7
1
Operating system organization
17
2
Page tables
29
3
Traps, interrupts, and drivers
39
4
Locking
51
5
Scheduling
61
6
File system
77
7
Summary
93
A PC hardware
95
B The boot loader
99
Index
105
DRAFT as of August 29, 2017
3
https://pdos.csail.mit.edu/6.828/xv6
Foreword and acknowledgements
This is a draft text intended for a class on operating systems. It explains the main concepts of operating systems by studying an example kernel, named xv6. xv6 is a re-implementation of Dennis Ritchie’s and Ken Thompson’s Unix Version 6 (v6). xv6 loosely follows the structure and style of v6, but is implemented in ANSI C for an x86based multiprocessor.
The text should be read along with the source code for xv6. This approach is inspired
by John Lions’s Commentary on UNIX 6th Edition (Peer to Peer Communications; ISBN: 1-57398-013-7; 1st edition (June 14, 2000)). See https://pdos.csail.mit.edu/6.828 for
pointers to on-line resources for v6 and xv6.
We have used this text in 6.828, the operating systems class at MIT. We thank the faculty, teaching assistants, and students of 6.828 who have all directly or indirectly contributed to xv6. In particular, we would like to thank Austin Clements and Nickolai
Zeldovich. Finally, we would like to thank people who emailed us bugs in the text:
Abutalib Aghayev, Sebastian Boehm, Anton Burtsev, Raphael Carvalho, Color Fuzzy,
Giuseppe, Wolfgang Keller, Austin Liew, Pavan Maddamsetti, Jacek Masiulaniec, Askar
Safin, Salman Shah, Pawel Szczurko, Warren Toomey, and Zou Chang Wei.
If you spot errors or have suggestions for improvement, please send email to Frans
Kaashoek and Robert Morris (kaashoek,rtm@csail.mit.edu).
DRAFT as of August 29, 2017
5
https://pdos.csail.mit.edu/6.828/xv6
interface design
kernel
process
system call
user space
kernel space
Chapter 0
Operating system interfaces
The job of an operating system is to share a computer among multiple programs
and to provide a more useful set of services than the hardware alone supports. The
operating system manages and abstracts the low-level hardware, so that, for example, a
word processor need not concern itself with which type of disk hardware is being
used. It also shares the hardware among multiple programs so that they run (or appear to run) at the same time. Finally, operating systems provide controlled ways for
programs to interact, so that they can share data or work together.
An operating system provides services to user programs through an interface.
Designing a good interface turns out to be difficult. On the one hand, we would like
the interface to be simple and narrow because that makes it easier to get the implementation right. On the other hand, we may be tempted to offer many sophisticated
features to applications. The trick in resolving this tension is to design interfaces that
rely on a few mechanisms that can be combined to provide much generality.
This book uses a single operating system as a concrete example to illustrate operating system concepts. That operating system, xv6, provides the basic interfaces introduced by Ken Thompson and Dennis Ritchie’s Unix operating system, as well as mimicking Unix’s internal design. Unix provides a narrow interface whose mechanisms
combine well, offering a surprising degree of generality. This interface has been so
successful that modern operating systems—BSD, Linux, Mac OS X, Solaris, and even,
to a lesser extent, Microsoft Windows—have Unix-like interfaces. Understanding xv6
is a good start toward understanding any of these systems and many others.
As shown in Figure 0-1, xv6 takes the traditional form of a kernel, a special program that provides services to running programs. Each running program, called a
process, has memory containing instructions, data, and a stack. The instructions implement the program’s computation. The data are the variables on which the computation acts. The stack organizes the program’s procedure calls.
When a process needs to invoke a kernel service, it invokes a procedure call in
the operating system interface. Such a procedure is called a system call. The system
call enters the kernel; the kernel performs the service and returns. Thus a process alternates between executing in user space and kernel space.
The kernel uses the CPU’s hardware protection mechanisms to ensure that each
process executing in user space can access only its own memory. The kernel executes
with the hardware privileges required to implement these protections; user programs
execute without those privileges. When a user program invokes a system call, the
hardware raises the privilege level and starts executing a pre-arranged function in the
kernel.
The collection of system calls that a kernel provides is the interface that user programs see. The xv6 kernel provides a subset of the services and system calls that Unix
kernels traditionally offer. Figure 0-2 lists all of xv6’s system calls.
DRAFT as of August 29, 2017
7
https://pdos.csail.mit.edu/6.828/xv6
process
user
space
cat
shell
kernel
space
system
call
Kernel
Figure 0-1. A kernel and two user processes.
The rest of this chapter outlines xv6’s services—processes, memory, file descrip- shell
time-share
tors, pipes, and file system—and illustrates them with code snippets and discussions of pid+code
how the shell, which is the primary user interface to traditional Unix-like systems, fork+code
uses them. The shell’s use of system calls illustrates how carefully they have been de- child process
parent process
signed.
fork+code
The shell is an ordinary program that reads commands from the user and exe- exit+code
cutes them. The fact that the shell is a user program, not part of the kernel, illustrates
the power of the system call interface: there is nothing special about the shell. It also
means that the shell is easy to replace; as a result, modern Unix systems have a variety
of shells to choose from, each with its own user interface and scripting features. The
xv6 shell is a simple implementation of the essence of the Unix Bourne shell. Its implementation can be found at line (8550).
Processes and memory
An xv6 process consists of user-space memory (instructions, data, and stack) and
per-process state private to the kernel. Xv6 can time-share processes: it transparently
switches the available CPUs among the set of processes waiting to execute. When a
process is not executing, xv6 saves its CPU registers, restoring them when it next runs
the process. The kernel associates a process identifier, or pid, with each process.
A process may create a new process using the fork system call. Fork creates a
new process, called the child process, with exactly the same memory contents as the
calling process, called the parent process. Fork returns in both the parent and the
child. In the parent, fork returns the child’s pid; in the child, it returns zero. For example, consider the following program fragment:
int pid = fork();
if(pid > 0){
printf(“parent: child=%d\n”, pid);
pid = wait();
printf(“child %d is done\n”, pid);
} else if(pid == 0){
printf(“child: exiting\n”);
exit();
} else {
printf(“fork error\n”);
}
The exit system call causes the calling process to stop executing and to release reDRAFT as of August 29, 2017
8
https://pdos.csail.mit.edu/6.828/xv6
System call
fork()
exit()
wait()
kill(pid)
getpid()
sleep(n)
exec(filename, *argv)
sbrk(n)
open(filename, flags)
read(fd, buf, n)
write(fd, buf, n)
close(fd)
dup(fd)
pipe(p)
chdir(dirname)
mkdir(dirname)
mknod(name, major, minor)
fstat(fd)
link(f1, f2)
unlink(filename)
Figure 0-2. Xv6 system calls
Description
Create a process
Terminate the current process
Wait for a child process to exit
Terminate process pid
Return the current process’s pid
Sleep for n clock ticks
Load a file and execute it
Grow process’s memory by n bytes
Open a file; the flags indicate read/write
Read n bytes from an open file into buf
Write n bytes to an open file
Release open file fd
Duplicate fd
Create a pipe and return fd’s in p
Change the current directory
Create a new directory
Create a device file
Return info about an open file
Create another name (f2) for the file f1
Remove a file
sources such as memory and open files. The wait system call returns the pid of an wait+code
wait+code
exited child of the current process; if none of the caller’s children has exited, wait printf+code
wait+code
waits for one to do so. In the example, the output lines
exec+code
exec+code
parent: child=1234
child: exiting
might come out in either order, depending on whether the parent or child gets to its
printf call first. After the child exits the parent’s wait returns, causing the parent to
print
parent: child 1234 is done
Although the child has the same memory contents as the parent initially, the parent
and child are executing with different memory and different registers: changing a variable in one does not affect the other. For example, when the return value of wait is
stored into pid in the parent process, it doesn’t change the variable pid in the child.
The value of pid in the child will still be zero.
The exec system call replaces the calling process’s memory with a new memory
image loaded from a file stored in the file system. The file must have a particular format, which specifies which part of the file holds instructions, which part is data, at
which instruction to start, etc. xv6 uses the ELF format, which Chapter 2 discusses in
more detail. When exec succeeds, it does not return to the calling program; instead,
the instructions loaded from the file start executing at the entry point declared in the
ELF header. Exec takes two arguments: the name of the file containing the executable
and an array of string arguments. For example:
DRAFT as of August 29, 2017
9
https://pdos.csail.mit.edu/6.828/xv6
getcmd+code
fork+code
exec+code
fork+code
exec+code
malloc+code
sbrk+code
file descriptor
char *argv[3];
argv[0] = “echo”;
argv[1] = “hello”;
argv[2] = 0;
exec(“/bin/echo”, argv);
printf(“exec error\n”);
This fragment replaces the calling program with an instance of the program
/bin/echo running with the argument list echo hello. Most programs ignore the first
argument, which is conventionally the name of the program.
The xv6 shell uses the above calls to run programs on behalf of users. The main
structure of the shell is simple; see main (8701). The main loop reads a line of input
from the user with getcmd. Then it calls fork, which creates a copy of the shell process. The parent calls wait, while the child runs the command. For example, if the
user had typed ‘‘echo hello’’ to the shell, runcmd would have been called with ‘‘echo
hello’’ as the argument. runcmd (8606) runs the actual command. For ‘‘echo hello’’, it
would call exec (8626). If exec succeeds then the child will execute instructions from
echo instead of runcmd. At some point echo will call exit, which will cause the parent to return from wait in main (8701). You might wonder why fork and exec are not
combined in a single call; we will see later that separate calls for creating a process
and loading a program is a clever design.
Xv6 allocates most user-space memory implicitly: fork allocates the memory required for the child’s copy of the parent’s memory, and exec allocates enough memory
to hold the executable file. A process that needs more memory at run-time (perhaps
for malloc) can call sbrk(n) to grow its data memory by n bytes; sbrk returns the
location of the new memory.
Xv6 does not provide a notion of users or of protecting one user from another; in
Unix terms, all xv6 processes run as root.
I/O and File descriptors
A file descriptor is a small integer representing a kernel-managed object that
a process may read from or write to. A process may obtain a file descriptor by opening a file, directory, or device, or by creating a pipe, or by duplicating an existing descriptor. For simplicity we’ll often refer to the object a file descriptor refers to as a
‘‘file’’; the file descriptor interface abstracts away the differences between files, pipes,
and devices, making them all look like streams of bytes.
Internally, the xv6 kernel uses the file descriptor as an index into a per-process table, so that every process has a private space of file descriptors starting at zero. By
convention, a process reads from file descriptor 0 (standard input), writes output to file
descriptor 1 (standard output), and writes error messages to file descriptor 2 (standard
error). As we will see, the shell exploits the convention to implement I/O redirection
and pipelines. The shell ensures that it always has three file descriptors open (8707),
which are by default file descriptors for the console.
The read and write system calls read bytes from and write bytes to open files
named by file descriptors. The call read(fd, buf, n) reads at most n bytes from the
DRAFT as of August 29, 2017
10
https://pdos.csail.mit.edu/6.828/xv6
file descriptor fd, copies them into buf, and returns the number of bytes read. Each fork+code
exec+code
file descriptor that refers to a file has an offset associated with it. Read reads data
from the current file offset and then advances that offset by the number of bytes read:
a subsequent read will return the bytes following the ones returned by the first read.
When there are no more bytes to read, read returns zero to signal the end of the file.
The call write(fd, buf, n) writes n bytes from buf to the file descriptor fd and
returns the number of bytes written. Fewer than n bytes are written only when an error occurs. Like read, write writes data at the current file offset and then advances
that offset by the number of bytes written: each write picks up where the previous
one left off.
The following program fragment (which forms the essence of cat) copies data
from its standard input to its standard output. If an error occurs, it writes a message
to the standard error.
char buf[512];
int n;
for(;;){
n = read(0, buf, sizeof buf);
if(n == 0)
break;
if(n < 0){
fprintf(2, "read error\n");
exit();
}
if(write(1, buf, n) != n){
fprintf(2, "write error\n");
exit();
}
}
The important thing to note in the code fragment is that cat doesn’t know whether it
is reading from a file, console, or a pipe. Similarly cat doesn’t know whether it is
printing to a console, a file, or whatever. The use of file descriptors and the convention that file descriptor 0 is input and file descriptor 1 is output allows a simple implementation of cat.
The close system call releases a file descriptor, making it free for reuse by a future open, pipe, or dup system call (see below). A newly allocated file descriptor is always the lowest-numbered unused descriptor of the current process.
File descriptors and fork interact to make I/O redirection easy to implement.
Fork copies the parent’s file descriptor table along with its memory, so that the child
starts with exactly the same open files as the parent. The system call exec replaces the
calling process’s memory but preserves its file table. This behavior allows the shell to
implement I/O redirection by forking, reopening chosen file descriptors, and then execing the new program. Here is a simplified version of the code a shell runs for the
command cat < input.txt:
DRAFT as of August 29, 2017
11
https://pdos.csail.mit.edu/6.828/xv6
char *argv[2];
argv[0] = "cat";
argv[1] = 0;
if(fork() == 0) {
close(0);
open("input.txt", O_RDONLY);
exec("cat", argv);
}
After the child closes file descriptor 0, open is guaranteed to use that file descriptor for
the newly opened input.txt: 0 will be the smallest available file descriptor. Cat then
executes with file descriptor 0 (standard input) referring to input.txt.
The code for I/O redirection in the xv6 shell works in exactly this way (8630). Recall that at this point in the code the shell has already forked the child shell and that
runcmd will call exec to load the new program. Now it should be clear why it is a
good idea that fork and exec are separate calls. Because if they are separate, the shell
can fork a child, use open, close, dup in the child to change the standard input and
output file descriptors, and then exec. No changes to the program being exec-ed (cat
in our example) are required. If fork and exec were combined into a single system
call, some other (probably more complex) scheme would be required for the shell to
redirect standard input and output, or the program itself would have to understand
how to redirect I/O.
Although fork copies the file descriptor table, each underlying file offset is shared
between parent and child. Consider this example:
if(fork() == 0) {
write(1, "hello ", 6);
exit();
} else {
wait();
write(1, "world\n", 6);
}
At the end of this fragment, the file attached to file descriptor 1 will contain the data
hello world. The write in the parent (which, thanks to wait, runs only after the
child is done) picks up where the child’s write left off. This behavior helps produce
sequential output from sequences of shell commands, like (echo hello; echo world)
>output.txt.
The dup system call duplicates an existing file descriptor, returning a new one that
refers to the same underlying I/O object. Both file descriptors share an offset, just as
the file descriptors duplicated by fork do. This is another way to write hello world
into a file:
fd = dup(1);
write(1, “hello “, 6);
write(fd, “world\n”, 6);
Two file descriptors share an offset if they were derived from the same original
file descriptor by a sequence of fork and dup calls. Otherwise file descriptors do not
share offsets, even if they resulted from open calls for the same file. Dup allows shells
DRAFT as of August 29, 2017
12
https://pdos.csail.mit.edu/6.828/xv6
to implement commands like this: ls existing-file non-existing-file > tmp1 pipe
2>&1. The 2>&1 tells the shell to give the command a file descriptor 2 that is a duplicate of descriptor 1. Both the name of the existing file and the error message for the
non-existing file will show up in the file tmp1. The xv6 shell doesn’t support I/O redirection for the error file descriptor, but now you know how to implement it.
File descriptors are a powerful abstraction, because they hide the details of what
they are connected to: a process writing to file descriptor 1 may be writing to a file, to
a device like the console, or to a pipe.
Pipes
A pipe is a small kernel buffer exposed to processes as a pair of file descriptors,
one for reading and one for writing. Writing data to one end of the pipe makes that
data available for reading from the other end of the pipe. Pipes provide a way for
processes to communicate.
The following example code runs the program wc with standard input connected
to the read end of a pipe.
int p[2];
char *argv[2];
argv[0] = “wc”;
argv[1] = 0;
pipe(p);
if(fork() == 0) {
close(0);
dup(p[0]);
close(p[0]);
close(p[1]);
exec(“/bin/wc”, argv);
} else {
close(p[0]);
write(p[1], “hello world\n”, 12);
close(p[1]);
}
The program calls pipe, which creates a new pipe and records the read and write file
descriptors in the array p. After fork, both parent and child have file descriptors referring to the pipe. The child dups the read end onto file descriptor 0, closes the file descriptors in p, and execs wc. When wc reads from its standard input, it reads from the
pipe. The parent closes the read side of the pipe, writes to the pipe, and then closes
the write side.
If no data is available, a read on a pipe waits for either data to be written or all
file descriptors referring to the write end to be closed; in the latter case, read will return 0, just as if the end of a data file had been reached. The fact that read blocks
until it is impossible for new data to arrive is one reason that it’s important for the
child to close the write end of the pipe before executing wc above: if one of wc’s file
descriptors referred to the write end of the pipe, wc would never see end-of-file.
DRAFT as of August 29, 2017
13
https://pdos.csail.mit.edu/6.828/xv6
The xv6 shell implements pipelines such as grep fork sh.c | wc -l in a man- root
path
ner similar to the above code (8650). The child process creates a pipe to connect the current directory
left end of the pipeline with the right end. Then it calls fork and runcmd for the left
end of the pipeline and fork and runcmd for the right end, and waits for both to finish. The right end of the pipeline may be a command that itself includes a pipe (e.g.,
a | b | c), which itself forks two new child processes (one for b and one for c). Thus,
the shell may create a tree of processes. The leaves of this tree are commands and the
interior nodes are processes that wait until the left and right children complete. In
principle, you could have the interior nodes run the left end of a pipeline, but doing so
correctly would complicate the implementation.
Pipes may seem no more powerful than temporary files: the pipeline
echo hello world | wc
could be implemented without pipes as
echo hello world >/tmp/xyz; wc xxx+code
expand as needed.
thread
Each process’s address space maps the kernel’s instructions and data as well as the p->kstack+code
user program’s memory. When a process invokes a system call, the system call executes in the kernel mappings of the process’s address space. This arrangement exists
so that the kernel’s system call code can directly refer to user memory. In order to
leave plenty of room for user memory, xv6’s address spaces map the kernel at high addresses, starting at 0x80100000.
The xv6 kernel maintains many pieces of state for each process, which it gathers
into a struct proc (2337). A process’s most important pieces of kernel state are its
page table, its kernel stack, and its run state. We’ll use the notation p->xxx to refer to
elements of the proc structure.
Each process has a thread of execution (or thread for short) that executes the
process’s instructions. A thread can be suspended and later resumed. To switch transparently between processes, the kernel suspends the currently running thread and resumes another process’s thread. Much of the state of a thread (local variables, function
call return addresses) is stored on the thread’s stacks. Each process has two stacks: a
user stack and a kernel stack (p->kstack). When the process is executing user instructions, only its user stack is in use, and its kernel stack is empty. When the process enters the kernel (for a system call or interrupt), the kernel code executes on the
process’s kernel stack; while a process is in the kernel, its user stack still contains saved
data, but isn’t actively used. A process’s thread alternates between actively using its
user stack and its kernel stack. The kernel stack is separate (and protected from user
code) so that the kernel can execute even if a process has wrecked its user stack.
When a process makes a system call, the processor switches to the kernel stack,
raises the hardware privilege level, and starts executing the kernel instructions that implement the system call. When the system call completes, the kernel returns to user
space: the hardware lowers its privilege level, switches back to the user stack, and reDRAFT as of August 29, 2017
21
https://pdos.csail.mit.edu/6.828/xv6
sumes executing user instructions just after the system call instruction. A process’s p->state+code
p->pgdir+code
thread can ‘‘block’’ in the kernel to wait for I/O, and resume where it left off when the boot loader
entry+code
I/O has finished.
p->state indicates whether the process is allocated, ready to run, running, wait- KERNBASE+code
entry+code
ing for I/O, or exiting.
entrypgdir+code
p->pgdir holds the process’s page table, in the format that the x86 hardware expects. xv6 causes the paging hardware to use a process’s p->pgdir when executing
that process. A process’s page table also serves as the record of the addresses of the
physical pages allocated to store the process’s memory.
Code: the first address space
To make the xv6 organization more concrete, we’ll look how the kernel creates the first
address space (for itself), how the kernel creates and starts the first process, and how
that process performs the first system call. By tracing these operations we see in detail
how xv6 provides strong isolation for processes. The first step in providing strong isolation is setting up the kernel to run in its own address space.
When a PC powers on, it initializes itself and then loads a boot loader from
disk into memory and executes it. Appendix B explains the details. Xv6’s boot loader
loads the xv6 kernel from disk and executes it starting at entry (1044). The x86 paging
hardware is not enabled when the kernel starts; virtual addresses map directly to physical addresses.
The boot loader loads the xv6 kernel into memory at physical address 0x100000.
The reason it doesn’t load the kernel at 0x80100000, where the kernel expects to find
its instructions and data, is that there may not be any physical memory at such a high
address on a small machine. The reason it places the kernel at 0x100000 rather than
0x0 is because the address range 0xa0000:0x100000 contains I/O devices.
To allow the rest of the kernel to run, entry sets up a page table that maps virtual addresses starting at 0x80000000 (called KERNBASE (0207)) to physical addresses starting at 0x0 (see Figure 1-2). Setting up two ranges of virtual addresses that map to the
same physical memory range is a common use of page tables, and we will see more
examples like this one.
The entry page table is defined in main.c (1306). We look at the details of page tables in Chapter 2, but the short story is that entry 0 maps virtual addresses
0:0x400000 to physical addresses 0:0x400000. This mapping is required as long as
entry is executing at low addresses, but will eventually be removed.
Entry 512 maps virtual addresses KERNBASE:KERNBASE+0x400000 to physical addresses 0:0x400000. This entry will be used by the kernel after entry has finished; it
maps the high virtual addresses at which the kernel expects to find its instructions and
data to the low physical addresses where the boot loader loaded them. This mapping
restricts the kernel instructions and data to 4 Mbytes.
Returning to entry, it loads the physical address of entrypgdir into control register %cr3. The value in %cr3 must be a physical address. It wouldn’t make sense for
%cr3 to hold the virtual address of entrypgdir, because the paging hardware doesn’t
know how to translate virtual addresses yet; it doesn’t have a page table yet. The sym-
DRAFT as of August 29, 2017
22
https://pdos.csail.mit.edu/6.828/xv6
0xFFFFFFFF
text and data
0x80100000
0x80000000
BIOS
Top physical
memory
4 Mbyte
kernel text
and data
BIOS
text and data
0
0
Physical memory
Virtual address space
Figure 1-3. Layout of a virtual address space
bol entrypgdir refers to an address in high memory, and the macro V2P_WO (0213) V2P_WO+code
CR0_PG+code
subtracts KERNBASE in order to find the physical address. To enable the paging hard- main+code
main+code
ware, xv6 sets the flag CR0_PG in the control register %cr0.
The processor is still executing instructions at low addresses after paging is en- main+code
allocproc+code
abled, which works since entrypgdir maps low addresses. If xv6 had omitted entry 0
from entrypgdir, the computer would have crashed when trying to execute the instruction after the one that enabled paging.
Now entry needs to transfer to the kernel’s C code, and run it in high memory.
First it makes the stack pointer, %esp, point to memory to be used as a stack (1058). All
symbols have high addresses, including stack, so the stack will still be valid even
when the low mappings are removed. Finally entry jumps to main, which is also a
high address. The indirect jump is needed because the assembler would otherwise
generate a PC-relative direct jump, which would execute the low-memory version of
main. Main cannot return, since the there’s no return PC on the stack. Now the kernel
is running in high addresses in the function main (1217).
Code: creating the first process
Now we’ll look at how the kernel creates user-level processes and ensures that
they are strongly isolated.
After main (1217) initializes several devices and subsystems, it creates the first process by calling userinit (2520). Userinit’s first action is to call allocproc. The job
of allocproc (2473) is to allocate a slot (a struct proc) in the process table and to
initialize the parts of the process’s state required for its kernel thread to execute. Allocproc is called for each new process, while userinit is called only for the very first
process. Allocproc scans the proc table for a slot with state UNUSED (2480-2482). When
DRAFT as of August 29, 2017
23
https://pdos.csail.mit.edu/6.828/xv6
top of new stack
esp
…
eip
…
p->tf
address forkret will return to
edi
trapret
eip
…
p->context
edi
(empty)
p->kstack
Figure 1-4. A new kernel stack.
it finds an unused slot, allocproc sets the state to EMBRYO to mark it as used and EMBRYO+code
pid+code
gives the process a unique pid (2469-2489). Next, it tries to allocate a kernel stack for forkret+code
the process’s kernel thread. If the memory allocation fails, allocproc changes the trapret+code
p->context+code
state back to UNUSED and returns zero to signal failure.
Now allocproc must set up the new process’s kernel stack. allocproc is written
so that it can be used by fork as well as when creating the first process. allocproc
sets up the new process with a specially prepared kernel stack and set of kernel registers that cause it to ‘‘return’’ to user space when it first runs. The layout of the prepared kernel stack will be as shown in Figure 1-4. allocproc does part of this work
by setting up return program counter values that will cause the new process’s kernel
thread to first execute in forkret and then in trapret (2507-2512). The kernel thread
will start executing with register contents copied from p->context. Thus setting p>context->eip to forkret will cause the kernel thread to execute at the start of
forkret (2853). This function will return to whatever address is at the bottom of the
stack. The context switch code (3058) sets the stack pointer to point just beyond the
end of p->context. allocproc places p->context on the stack, and puts a pointer to
trapret just above it; that is where forkret will return. trapret restores user registers from values stored at the top of the kernel stack and jumps into the process (3324).
This setup is the same for ordinary fork and for creating the first process, though in
the latter case the process will start executing at user-space location zero rather than at
a return from fork.
As we will see in Chapter 3, the way that control transfers from user software to
the kernel is via an interrupt mechanism, which is used by system calls, interrupts, and
exceptions. Whenever control transfers into the kernel while a process is running, the
DRAFT as of August 29, 2017
24
https://pdos.csail.mit.edu/6.828/xv6
forkret+code
trapret+code
forkret+code
trapret+code
hardware and xv6 trap entry code save user registers on the process’s kernel stack. userinit+code
struct
userinit writes values at the top of the new stack that look just like those that would trapframe+code
be there if the process had entered the kernel via an interrupt (2533-2539), so that the or- initcode.S+code
dinary code for returning from the kernel back to the process’s user code will work. userinit+code
setupkvm+code
These values are a struct trapframe which stores the user registers. Now the new initcode.S+code
_binary_initcode_start+
process’s kernel stack is completely prepared as shown in Figure 1-4.
The first process is going to execute a small program (initcode.S; (8400)). The _binary_initcode_size+
inituvm+code
process needs physical memory in which to store this program, the program needs to SEG_UCODE+code
be copied to that memory, and the process needs a page table that maps user-space DPL_USER+code
SEG_UDATA+code
addresses to that memory.
DPL_USER+code
userinit calls setupkvm (1818) to create a page table for the process with (at first) FL_IF+code
mappings only for memory that the kernel uses. We will study this function in detail userinit+code
in Chapter 2, but at a high level setupkvm and userinit create an address space as p->name+code
p->cwd+code
shown in Figure 1-2.
namei+code
The initial contents of the first process’s user-space memory are the compiled userinit+code
form of initcode.S; as part of the kernel build process, the linker embeds that binary RUNNABLE+code
mpmain+code
in the kernel and defines two special symbols, _binary_initcode_start and _bina- scheduler+code
ry_initcode_size, indicating the location and size of the binary. Userinit copies switchuvm+code
that binary into the new process’s memory by calling inituvm, which allocates one setupkvm+code
SEG_TSS+code
page of physical memory, maps virtual address zero to that memory, and copies the binary to that page (1886).
Then userinit sets up the trap frame (0602) with the initial user mode state: the
%cs register contains a segment selector for the SEG_UCODE segment running at privilege level DPL_USER (i.e., user mode rather than kernel mode), and similarly %ds, %es,
and %ss use SEG_UDATA with privilege DPL_USER. The %eflags FL_IF bit is set to allow hardware interrupts; we will reexamine this in Chapter 3.
The stack pointer %esp is set to the process’s largest valid virtual address, p->sz.
The instruction pointer is set to the entry point for the initcode, address 0.
The function userinit sets p->name to initcode mainly for debugging. Setting
p->cwd sets the process’s current working directory; we will examine namei in detail in
Chapter 6.
Once the process is initialized, userinit marks it available for scheduling by setting p->state to RUNNABLE.
Code: Running the first process
Now that the first process’s state is prepared, it is time to run it. After main calls
userinit, mpmain calls scheduler to start running processes (1257). Scheduler (2758)
looks for a process with p->state set to RUNNABLE, and there’s only one: initproc. It
sets the per-cpu variable proc to the process it found and calls switchuvm to tell the
hardware to start using the target process’s page table (1879). Changing page tables
while executing in the kernel works because setupkvm causes all processes’ page tables
to have identical mappings for kernel code and data. switchuvm also sets up a task
state segment SEG_TSS that instructs the hardware to execute system calls and interrupts on the process’s kernel stack. We will re-examine the task state segment in
DRAFT as of August 29, 2017
25
https://pdos.csail.mit.edu/6.828/xv6
scheduler+code
Chapter 3.
swtch+code
scheduler now sets p->state to RUNNING and calls swtch (3058) to perform a cpucontext switch to the target process’s kernel thread. swtch first saves the current regis- >scheduler+code
ters. The current context is not a process but rather a special per-cpu scheduler con- swtch+code
ret+code
text, so scheduler tells swtch to save the current hardware registers in per-cpu stor- forkret+code
age (cpu->scheduler) rather than in any process’s kernel thread context. swtch then ret+code
loads the saved registers of the target kernel thread (p->context) into the x86 hard- forkret+code
forkret+code
ware registers, including the stack pointer and instruction pointer. We’ll examine main+code
swtch in more detail in Chapter 5. The final ret instruction (3077) pops the target p->context+code
process’s %eip from the stack, finishing the context switch. Now the processor is run- trapret+code
swtch+code
ning on the kernel stack of process p.
popal+code
Allocproc had previously set initproc’s p->context->eip to forkret, so the popl+code
ret starts executing forkret. On the first invocation (that is this one), forkret (2853) addl+code
iret+code
runs initialization functions that cannot be run from main because they must be run initproc+code
in the context of a regular process with its own kernel stack. Then, forkret returns. initcode.S+code
Allocproc arranged that the top word on the stack after p->context is popped off allocuvm+code
PTE_U+code
would be trapret, so now trapret begins executing, with %esp set to p->tf. userinit+code
Trapret (3324) uses pop instructions to restore registers from the trap frame (0602) just exec+code
as swtch did with the kernel context: popal restores the general registers, then the SYS_exec+code
T_SYSCALL+code
popl instructions restore %gs, %fs, %es, and %ds. The addl skips over the two fields exec+code
trapno and errcode. Finally, the iret instruction pops %cs, %eip, %flags, %esp, and
%ss from the stack. The contents of the trap frame have been transferred to the CPU
state, so the processor continues at the %eip specified in the trap frame. For initproc, that means virtual address zero, the first instruction of initcode.S.
At this point, %eip holds zero and %esp holds 4096. These are virtual addresses
in the process’s address space. The processor’s paging hardware translates them into
physical addresses. allocuvm has set up the process’s page table so that virtual address
zero refers to the physical memory allocated for this process, and set a flag (PTE_U)
that tells the paging hardware to allow user code to access that memory. The fact that
userinit (2533) set up the low bits of %cs to run the process’s user code at CPL=3
means that the user code can only use pages with PTE_U set, and cannot modify sensitive hardware registers such as %cr3. So the process is constrained to using only its
own memory.
The first system call: exec
Now that we have seen how the kernel provides strong isolation for processes, let’s
look at how a user-level process re-enters the kernel to ask for services that it cannot
perform itself.
The first action of initcode.S is to invoke the exec system call. As we saw in
Chapter 0, exec replaces the memory and registers of the current process with a new
program, but it leaves the file descriptors, process id, and parent process unchanged.
Initcode.S (8409) begins by pushing three values on the stack—$argv, $init,
and $0—and then sets %eax to SYS_exec and executes int T_SYSCALL: it is asking the
kernel to run the exec system call. If all goes well, exec never returns: it starts run-
DRAFT as of August 29, 2017
26
https://pdos.csail.mit.edu/6.828/xv6
ning the program named by $init, which is a pointer to the NUL-terminated string exit+code
/init+code
/init (8422-8424). The other argument is the argv array of command-line arguments; initcode+code
the zero at the end of the array marks its end. If the exec fails and does return, init- /init+code
code loops calling the exit system call, which definitely should not return (8416-8420).
This code manually crafts the first system call to look like an ordinary system
call, which we will see in Chapter 3. As before, this setup avoids special-casing the
first process (in this case, its first system call), and instead reuses code that xv6 must
provide for standard operation.
Chapter 2 will cover the implementation of exec in detail, but at a high level it
replaces initcode with the /init binary, loaded out of the file system. Now initcode (8400) is done, and the process will run /init instead. Init (8510) creates a new
console device file if needed and then opens it as file descriptors 0, 1, and 2. Then it
loops, starting a console shell, handles orphaned zombies until the shell exits, and repeats. The system is up.
Real world
Most operating systems have adopted the process concept, and most processes
look similar to xv6’s. A real operating system would find free proc structures with an
explicit free list in constant time instead of the linear-time search in allocproc; xv6
uses the linear scan (the first of many) for simplicity.
xv6’s address space layout has the defect that it cannot make use of more than 2
GB of physical RAM. It’s possible to fix this, though the best plan would be to switch
to a machine with 64-bit addresses.
Exercises
1. Set a breakpoint at swtch. Single step with gdb’s stepi through the ret to forkret,
then use gdb’s finish to proceed to trapret, then stepi until you get to initcode
at virtual address zero.
2. KERNBASE limits the amount of memory a single process can use, which might be
irritating on a machine with a full 4 GB of RAM. Would raising KERNBASE allow a
process to use more memory?
DRAFT as of August 29, 2017
27
https://pdos.csail.mit.edu/6.828/xv6
page table entries
(PTEs)
page
page directory
page table pages
PTE_P+code
Chapter 2
Page tables
Page tables are the mechanism through which the operating system controls what
memory addresses mean. They allow xv6 to multiplex the address spaces of different
processes onto a single physical memory, and to protect the memories of different processes. The level of indirection provided by page tables allows many neat tricks. xv6
uses page tables primarily to multiplex address spaces and to protect memory. It also
uses a few simple page-table tricks: mapping the same memory (the kernel) in several
address spaces, mapping the same memory more than once in one address space (each
user page is also mapped into the kernel’s physical view of memory), and guarding a
user stack with an unmapped page. The rest of this chapter explains the page tables
that the x86 hardware provides and how xv6 uses them. Compared to a real-world
operating system, xv6’s design is restrictive, but it does illustrate the key ideas.
Paging hardware
As a reminder, x86 instructions (both user and kernel) manipulate virtual addresses.
The machine’s RAM, or physical memory, is indexed with physical addresses. The x86
page table hardware connects these two kinds of addresses, by mapping each virtual
address to a physical address.
An x86 page table is logically an array of 2^20 (1,048,576) page table entries
(PTEs). Each PTE contains a 20-bit physical page number (PPN) and some flags. The
paging hardware translates a virtual address by using its top 20 bits to index into the
page table to find a PTE, and replacing the address’s top 20 bits with the PPN in the
PTE. The paging hardware copies the low 12 bits unchanged from the virtual to the
translated physical address. Thus a page table gives the operating system control over
virtual-to-physical address translations at the granularity of aligned chunks of 4096
(2^12) bytes. Such a chunk is called a page.
As shown in Figure 2-1, the actual translation happens in two steps. A page table
is stored in physical memory as a two-level tree. The root of the tree is a 4096-byte
page directory that contains 1024 PTE-like references to page table pages. Each
page table page is an array of 1024 32-bit PTEs. The paging hardware uses the top 10
bits of a virtual address to select a page directory entry. If the page directory entry is
present, the paging hardware uses the next 10 bits of the virtual address to select a
PTE from the page table page that the page directory entry refers to. If either the
page directory entry or the PTE is not present, the paging hardware raises a fault.
This two-level structure allows a page table to omit entire page table pages in the common case in which large ranges of virtual addresses have no mappings.
Each PTE contains flag bits that tell the paging hardware how the associated virtual address is allowed to be used. PTE_P indicates whether the PTE is present: if it is
DRAFT as of August 29, 2017
29
https://pdos.csail.mit.edu/6.828/xv6
Virtual address
10
10
Dir
Physical Address
12
Table Offset
20
12
PPN
Offset
20
12
PPN
Flags
1023
20
12
1023
1
0
PPN
Flags
Page Table
1
CR3
0
Page Directory
12 11 10 9 8 7 6 5 4 3 2 1 0
31
Physical Page Number
A
V
L
DA
CW
UW P
DT
P – Present
W – Writable
U – User
WT – 1=Write-through, 0=Write-back
CD – Cache Disabled
A – Accessed
D – Dirty (0 in page directory)
AVL – Available for system use
Page table and page directory
entries are identical except for
the D bit.
Figure 2-1. x86 page table hardware.
not set, a reference to the page causes a fault (i.e. is not allowed). PTE_W controls PTE_W+code
PTE_U+code
whether instructions are allowed to issue writes to the page; if not set, only reads and kvmalloc+code
instruction fetches are allowed. PTE_U controls whether user programs are allowed to
use the page; if clear, only the kernel is allowed to use the page. Figure 2-1 shows how
it all works. The flags and all other page hardware related structures are defined in
mmu.h (0700).
A few notes about terms. Physical memory refers to storage cells in DRAM. A
byte of physical memory has an address, called a physical address. Instructions use
only virtual addresses, which the paging hardware translates to physical addresses, and
then sends to the DRAM hardware to read or write storage. At this level of discussion
there is no such thing as virtual memory, only virtual addresses.
Process address space
The page table created by entry has enough mappings to allow the kernel’s C
code to start running. However, main immediately changes to a new page table by
calling kvmalloc (1840), because kernel has a more elaborate plan for describing process address spaces.
Each process has a separate page table, and xv6 tells the page table hardware to
switch page tables when xv6 switches between processes. As shown in Figure 2-2, a
process’s user memory starts at virtual address zero and can grow up to KERNBASE, allowing a process to address up to 2 GB of memory. The file memlayout.h (0200) de-
DRAFT as of August 29, 2017
30
https://pdos.csail.mit.edu/6.828/xv6
Virtual
4 Gig
Device memory
RW0xFE000000
Unused if less than 2 Gig
of physical memory
Free memory
RW-end
Kernel data
RW-
Kernel text
R–
+ 0x100000
RW-
Physical
4 Gig
KERNBASE
PHYSTOP
Unused if less than 2 Gig
of physical memory
Program data & heap
At most 2 Gig
PAGESIZE
User stack
RWU
User data
RWU
User text
RWU
Extended memory
0x100000
640K
I/O space
Base memory
0
0
Figure 2-2. Layout of the virtual address space of a process and the layout of the physical address
space. Note that if a machine has more than 2 Gbyte of physical memory, xv6 can use only the memory
that fits between KERNBASE and 0xFE00000.
clares the constants for xv6’s memory layout, and macros to convert virtual to physical
addresses.
When a process asks xv6 for more memory, xv6 first finds free physical pages to
provide the storage, and then adds PTEs to the process’s page table that point to the
new physical pages. xv6 sets the PTE_U, PTE_W, and PTE_P flags in these PTEs. Most
processes do not use the entire user address space; xv6 leaves PTE_P clear in unused
PTEs. Different processes’ page tables translate user addresses to different pages of
physical memory, so that each process has private user memory.
Xv6 includes all mappings needed for the kernel to run in every process’s page table; these mappings all appear above KERNBASE. It maps virtual addresses KERNBASE:KERNBASE+PHYSTOP to 0:PHYSTOP. One reason for this mapping is so that the
kernel can use its own instructions and data. Another reason is that the kernel sometimes needs to be able to write a given page of physical memory, for example when
creating page table pages; having every physical page appear at a predictable virtual
address makes this convenient. A defect of this arrangement is that xv6 cannot make
use of more than 2 GB of physical memory. Some devices that use memory-mapped
I/O appear at physical addresses starting at 0xFE000000, so xv6 page tables including
DRAFT as of August 29, 2017
31
https://pdos.csail.mit.edu/6.828/xv6
a direct mapping for them. Xv6 does not set the PTE_U flag in the PTEs above KERN- PTE_U+code
PTE_U+code
BASE, so only the kernel can use them.
main+code
Having every process’s page table contain mappings for both user memory and kvmalloc+code
the entire kernel is convenient when switching from user code to kernel code during setupkvm+code
mappages+code
system calls and interrupts: such switches do not require page table switches. For the kmap+code
most part the kernel does not have its own page table; it is almost always borrowing PHYSTOP+code
mappages+code
some process’s page table.
walkpgdir+code
To review, xv6 ensures that each process can use only its own memory. And, each walkpgdir+code
process sees its memory as having contiguous virtual addresses starting at zero, while PHYSTOP+code
the process’s physical memory can be non-contiguous. xv6 implements the first by setting the PTE_U bit only on PTEs of virtual addresses that refer to the process’s own
memory. It implements the second using the ability of page tables to translate successive virtual addresses to whatever physical pages happen to be allocated to the process.
Code: creating an address space
main calls kvmalloc (1840) to create and switch to a page table with the mappings
above KERNBASE required for the kernel to run. Most of the work happens in setupkvm (1818). It first allocates a page of memory to hold the page directory. Then it calls
mappages to install the translations that the kernel needs, which are described in the
kmap (1809) array. The translations include the kernel’s instructions and data, physical
memory up to PHYSTOP, and memory ranges which are actually I/O devices. setupkvm does not install any mappings for the user memory; this will happen later.
mappages (1760) installs mappings into a page table for a range of virtual addresses
to a corresponding range of physical addresses. It does this separately for each virtual
address in the range, at page intervals. For each virtual address to be mapped, mappages calls walkpgdir to find the address of the PTE for that address. It then initializes the PTE to hold the relevant physical page number, the desired permissions (
PTE_W and/or PTE_U), and PTE_P to mark the PTE as valid (1772).
walkpgdir (1735) mimics the actions of the x86 paging hardware as it looks up
the PTE for a virtual address (see Figure 2-1). walkpgdir uses the upper 10 bits of
the virtual address to find the page directory entry (1740). If the page directory entry
isn’t present, then the required page table page hasn’t yet been allocated; if the alloc
argument is set, walkpgdir allocates it and puts its physical address in the page directory. Finally it uses the next 10 bits of the virtual address to find the address of the
PTE in the page table page (1753).
Physical memory allocation
The kernel must allocate and free physical memory at run-time for page tables,
process user memory, kernel stacks, and pipe buffers.
xv6 uses the physical memory between the end of the kernel and PHYSTOP for
run-time allocation. It allocates and frees whole 4096-byte pages at a time. It keeps
track of which pages are free by threading a linked list through the pages themselves.
Allocation consists of removing a page from the linked list; freeing consists of adding
DRAFT as of August 29, 2017
32
https://pdos.csail.mit.edu/6.828/xv6
struct run+code
the freed page to the list.
main+code
There is a bootstrap problem: all of physical memory must be mapped in order kinit1+code
for the allocator to initialize the free list, but creating a page table with those mappings kinit2+code
involves allocating page-table pages. xv6 solves this problem by using a separate page PHYSTOP+code
freerange+code
allocator during entry, which allocates memory just after the end of the kernel’s data kfree+code
segment. This allocator does not support freeing and is limited by the 4 MB mapping PGROUNDUP+code
type cast
in the entrypgdir, but that is sufficient to allocate the first kernel page table.
kalloc+code
Code: Physical memory allocator
The allocator’s data structure is a free list of physical memory pages that are available for allocation. Each free page’s list element is a struct run (3115). Where does
the allocator get the memory to hold that data structure? It store each free page’s run
structure in the free page itself, since there’s nothing else stored there. The free list is
protected by a spin lock (3119-3123). The list and the lock are wrapped in a struct to
make clear that the lock protects the fields in the struct. For now, ignore the lock and
the calls to acquire and release; Chapter 4 will examine locking in detail.
The function main calls kinit1 and kinit2 to initialize the allocator (3131). The
reason for having two calls is that for much of main one cannot use locks or memory
above 4 megabytes. The call to kinit1 sets up for lock-less allocation in the first 4
megabytes, and the call to kinit2 enables locking and arranges for more memory to
be allocatable. main ought to determine how much physical memory is available, but
this turns out to be difficult on the x86. Instead it assumes that the machine has 240
megabytes (PHYSTOP) of physical memory, and uses all the memory between the end
of the kernel and PHYSTOP as the initial pool of free memory. kinit1 and kinit2 call
freerange to add memory to the free list via per-page calls to kfree. A PTE can only
refer to a physical address that is aligned on a 4096-byte boundary (is a multiple of
4096), so freerange uses PGROUNDUP to ensure that it frees only aligned physical addresses. The allocator starts with no memory; these calls to kfree give it some to
manage.
The allocator refers to physical pages by their virtual addresses as mapped in high
memory, not by their physical addresses, which is why kinit uses P2V(PHYSTOP) to
translate PHYSTOP (a physical address) to a virtual address. The allocator sometimes
treats addresses as integers in order to perform arithmetic on them (e.g., traversing all
pages in kinit), and sometimes uses addresses as pointers to read and write memory
(e.g., manipulating the run structure stored in each page); this dual use of addresses is
the main reason that the allocator code is full of C type casts. The other reason is
that freeing and allocation inherently change the type of the memory.
The function kfree (3164) begins by setting every byte in the memory being freed
to the value 1. This will cause code that uses memory after freeing it (uses ‘‘dangling
references’’) to read garbage instead of the old valid contents; hopefully that will cause
such code to break faster. Then kfree casts v to a pointer to struct run, records the
old start of the free list in r->next, and sets the free list equal to r. kalloc removes
and returns the first element in the free list.
DRAFT as of August 29, 2017
33
https://pdos.csail.mit.edu/6.828/xv6
KERNBASE
heap
PAGESIZE
stack
guard page
data
argument 0
…
argument N
nul-terminated string
0
address of argument 0 argv[argc]
…
address of argument N argv[0]
address of address of argv argument of main
argument 0
argc argument of main
argc
return PC for main
0xFFFFFFF
(empty)
text
0
Figure 2-3. Memory layout of a user process with its initial stack.
User part of an address space
sbrk+code
Figure 2-3 shows the layout of the user memory of an executing process in xv6.
Each user process starts at address 0. The bottom of the address space contains the
text for the user program, its data, and its stack. The heap is above the stack so that
the heap can expand when the process calls sbrk. Note that the text, data, and stack
sections are layed out contiguously in the process’s address space but xv6 is free to use
non-contiguous physical pages for those sections. For example, when xv6 expands a
process’s heap, it can use any free physical page for the new virtual page and then program the page table hardware to map the virtual page to the allocated physical page.
This flexibility is a major advantage of using paging hardware.
The stack is a single page, and is shown with the initial contents as created by exec. Strings containing the command-line arguments, as well as an array of pointers to
them, are at the very top of the stack. Just under that are values that allow a program
to start at main as if the function call main(argc, argv) had just started. To guard a
stack growing off the stack page, xv6 places a guard page right below the stack. The
guard page is not mapped and so if the stack runs off the stack page, the hardware
will generate an exception because it cannot translate the faulting address. A realworld operating system might allocate more space for the stack so that it can grow beyond one page.
Code: sbrk
Sbrk is the system call for a process to shrink or grow its memory. The system call is
implemented by the function growproc (2558). If n is postive, growproc allocates one
DRAFT as of August 29, 2017
34
https://pdos.csail.mit.edu/6.828/xv6
or more physical pages and maps them at the top of the process’s address space. If n Translation Lookaside Buffer (TLB)
is negative, growproc unmaps one or more pages from the process’s address space and namei+code
frees the corresponding physical pages. To make these changes, xv6 modifies the pro- ELF format
cess’s page table. The process’s page table is stored in memory, and so the kernel can struct elfhdr+code
ELF_MAGIC+code
update the table with ordinary assignment statements, which is what allocuvm and setupkvm+code
deallocuvm do. The x86 hardware caches page table entries in a Translation allocuvm+code
Look-aside Buffer (TLB), and when xv6 changes the page tables, it must invalidate loaduvm+code
loaduvm+code
the cached entries. If it didn’t invalidate the cached entries, then at some point later walkpgdir+code
the TLB might use an old mapping, pointing to a physical page that in the mean time readi+code
has been allocated to another process, and as a result, a process might be able to scrib- /init+code
ble on some other process’s memory. Xv6 invalidates stale cached entries, by reloading
cr3, the register that holds the address of the current page table.
Code: exec
Exec is the system call that creates the user part of an address space. It initializes the
user part of an address space from a file stored in the file system. Exec (6610) opens
the named binary path using namei (6623), which is explained in Chapter 6. Then, it
reads the ELF header. Xv6 applications are described in the widely-used ELF format,
defined in elf.h. An ELF binary consists of an ELF header, struct elfhdr (0955), followed by a sequence of program section headers, struct proghdr (0974). Each proghdr describes a section of the application that must be loaded into memory; xv6 programs have only one program section header, but other systems might have separate
sections for instructions and data.
The first step is a quick check that the file probably contains an ELF binary. An
ELF binary starts with the four-byte ‘‘magic number’’ 0x7F, ’E’, ’L’, ’F’, or
ELF_MAGIC (0952). If the ELF header has the right magic number, exec assumes that
the binary is well-formed.
Exec allocates a new page table with no user mappings with setupkvm (6637), allocates memory for each ELF segment with allocuvm (6651), and loads each segment into
memory with loaduvm (6655). allocuvm checks that the virtual addresses requested is
below KERNBASE. loaduvm (1903) uses walkpgdir to find the physical address of the allocated memory at which to write each page of the ELF segment, and readi to read
from the file.
The program section header for /init, the first user program created with exec,
looks like this:
# objdump -p _init
_init:
file format elf32-i386
Program Header:
LOAD off
0x00000054 vaddr 0x00000000 paddr 0x00000000 align 2**2
filesz 0x000008c0 memsz 0x000008cc flags rwx
The program section header’s filesz may be less than the memsz, indicating that
the gap between them should be filled with zeroes (for C global variables) rather than
DRAFT as of August 29, 2017
35
https://pdos.csail.mit.edu/6.828/xv6
read from the file. For /init, filesz is 2240 bytes and memsz is 2252 bytes, and thus allocuvm+code
exec+code
allocuvm allocates enough physical memory to hold 2252 bytes, but reads only 2240 ustack+code
argv+code
bytes from the file /init.
Now exec allocates and initializes the user stack. It allocates just one stack page. argc+code
copyout+code
Exec copies the argument strings to the top of the stack one at a time, recording the
pointers to them in ustack. It places a null pointer at the end of what will be the
argv list passed to main. The first three entries in ustack are the fake return PC,
argc, and argv pointer.
Exec places an inaccessible page just below the stack page, so that programs that
try to use more than one page will fault. This inaccessible page also allows exec to
deal with arguments that are too large; in that situation, the copyout (2118) function
that exec uses to copy arguments to the stack will notice that the destination page in
not accessible, and will return –1.
During the preparation of the new memory image, if exec detects an error like
an invalid program segment, it jumps to the label bad, frees the new image, and returns –1. Exec must wait to free the old image until it is sure that the system call will
succeed: if the old image is gone, the system call cannot return –1 to it. The only error cases in exec happen during the creation of the image. Once the image is complete, exec can install the new image (6701) and free the old one (6702). Finally, exec
returns 0.
Exec loads bytes from the ELF file into memory at addresses specified by the ELF
file. Users or processes can place whatever addresses they want into an ELF file. Thus
exec is risky, because the addresses in the ELF file may refer to the kernel, accidentally
or on purpose. The consequences for an unwary kernel could range from a crash to a
malicious subversion of the kernel’s isolation mechanisms (i.e., a security exploit). xv6
performs a number of checks to avoid these risks. To understand the importance of
these checks, consider what could happen if xv6 didn’t check if(ph.vaddr +
ph.memsz < ph.vaddr). This is a check for whether the sum overflows a 32-bit integer. The danger is that a user could construct an ELF binary with a ph.vaddr that
points into the kernel, and ph.memsz large enough that the sum overflows to 0x1000.
Since the sum is small, it would pass the check if(newsz >= KERNBASE) in allocuvm.
The subsequent call to loaduvm passes ph.vaddr by itself, without adding ph.memsz
and without checking ph.vaddr against KERNBASE, and would thus copy data from the
ELF binary into the kernel. This could be exploited by a user program to run arbitrary user code with kernel privileges. As this example illustrates, argument checking
must be done with great care. It is easy for a kernel developer to omit a crucial check,
and real-world kernels have a long history of missing checks whose absence can be exploited by user programs to obtain kernel privileges. It is likely that xv6 doesn’t do a
complete job of validating user-level data supplied to the kernel, which a malicious
user program might be able to exploit to circumvent xv6’s isolation.
Real world
Like most operating systems, xv6 uses the paging hardware for memory protection and mapping. Most operating systems use x86’s 64-bit paging hardware (which
DRAFT as of August 29, 2017
36
https://pdos.csail.mit.edu/6.828/xv6
has 3 levels of translation). 64-bit address spaces allow for a less restrictive memory seginit+code
CR_PSE+code
layout than xv6’s; for example, it would be easy to remove xv6’s limit of 2 Gbyte for
physical memory. Most operating systems make far more sophisticated use of paging
than xv6; for example, xv6 lacks demand paging from disk, copy-on-write fork, shared
memory, lazily-allocated pages, and automatically extending stacks. The x86 supports
address translation using segmentation (see Appendix B), but xv6 uses segments only
for the common trick of implementing per-cpu variables such as proc that are at a
fixed address but have different values on different CPUs (see seginit). Implementations of per-CPU (or per-thread) storage on non-segment architectures would dedicate
a register to holding a pointer to the per-CPU data area, but the x86 has so few general registers that the extra effort required to use segmentation is worthwhile.
On machines with lots of memory it might make sense to use the x86’s 4 Mbyte
‘‘super pages.’’ Small pages make sense when physical memory is small, to allow allocation and page-out to disk with fine granularity. For example, if a program uses only
8 Kbyte of memory, giving it a 4 Mbyte physical page is wasteful. Larger pages make
sense on machines with lots of RAM, and may reduce overhead for page-table manipulation. Xv6 uses super pages in one place: the initial page table (1306). The array initialization sets two of the 1024 PDEs, at indices zero and 512 (KERNBASE>>PDXSHIFT),
leaving the other PDEs zero. Xv6 sets the PTE_PS bit in these two PDEs to mark them
as super pages. The kernel also tells the paging hardware to allow super pages by setting the CR_PSE bit (Page Size Extension) in %cr4.
Xv6 should determine the actual RAM configuration, instead of assuming 240
MB. On the x86, there are at least three common algorithms: the first is to probe the
physical address space looking for regions that behave like memory, preserving the values written to them; the second is to read the number of kilobytes of memory out of a
known 16-bit location in the PC’s non-volatile RAM; and the third is to look in BIOS
memory for a memory layout table left as part of the multiprocessor tables. Reading
the memory layout table is complicated.
Memory allocation was a hot topic a long time ago, the basic problems being efficient use of limited memory and preparing for unknown future requests; see Knuth.
Today people care more about speed than space-efficiency. In addition, a more elaborate kernel would likely allocate many different sizes of small blocks, rather than (as in
xv6) just 4096-byte blocks; a real kernel allocator would need to handle small allocations as well as large ones.
Exercises
1. Look at real operating systems to see how they size memory.
2. If xv6 had not used super pages, what would be the right declaration for entrypgdir?
3. Write a user program that grows its address space with 1 byte by calling
sbrk(1). Run the program and investigate the page table for the program before the
call to sbrk and after the call to sbrk. How much space has the kernel allocated?
What does the pte for the new memory contain?
4. Modify xv6 so that the pages for the kernel are shared among processes, which
DRAFT as of August 29, 2017
37
https://pdos.csail.mit.edu/6.828/xv6
reduces memory consumption.
5. Unix implementations of exec traditionally include special handling for shell
scripts. If the file to execute begins with the text #!, then the first line is taken to be a
program to run to interpret the file. For example, if exec is called to run myprog
arg1 and myprog’s first line is #!/interp, then exec runs /interp with command
line /interp myprog arg1. Implement support for this convention in xv6.
6. Delete the check if(ph.vaddr + ph.memsz < ph.vaddr) in exec.c, and construct a user program that exploits that the check is missing.
7. How would you improve xv6’s memory layout if xv6 where running on a 64-bit
processor?
DRAFT as of August 29, 2017
38
https://pdos.csail.mit.edu/6.828/xv6
exception
interrupt
Chapter 3
Traps, interrupts, and drivers
When running a process, a CPU executes the normal processor loop: read an instruction, advance the program counter, execute the instruction, repeat. But there are
events on which control from a user program must transfer back to the kernel instead
of executing the next instruction. These events include a device signaling that it wants
attention, a user program doing something illegal (e.g., references a virtual address for
which there is no page table entry), or a user program asking the kernel for a service
with a system call. There are three main challenges in handling these events: 1) the
kernel must arrange that a processor switches from user mode to kernel mode (and
back); 2) the kernel and devices must coordinate their parallel activities; and 3) the
kernel must understand the interface of the devices. Addressing these 3 challenges requires detailed understanding of hardware and careful programming, and can result in
opaque kernel code. This chapter explains how xv6 addresses these three challenges.
Systems calls, exceptions, and interrupts
There are three cases when control must be transferred from a user program to the
kernel. First, a system call: when a user program asks for an operating system service,
as we saw at the end of the last chapter. Second, an exception: when a program performs an illegal action. Examples of illegal actions include divide by zero, attempt to
access memory for a page-table entry that is not present, and so on. Third, an interrupt: when a device generates a signal to indicate that it needs attention from the operating system. For example, a clock chip may generate an interrupt every 100 msec
to allow the kernel to implement time sharing. As another example, when the disk has
read a block from disk, it generates an interrupt to alert the operating system that the
block is ready to be retrieved.
The kernel handles all interrupts, rather than processes handling them, because in
most cases only the kernel has the required privilege and state. For example, in order
to time-slice among processes in response the clock interrupts, the kernel must be involved, if only to force uncooperative processes to yield the processor.
In all three cases, the operating system design must arrange for the following to
happen. The system must save the processor’s registers for future transparent resume.
The system must be set up for execution in the kernel. The system must chose a place
for the kernel to start executing. The kernel must be able to retrieve information about
the event, e.g., system call arguments. It must all be done securely; the system must
maintain isolation of user processes and the kernel.
To achieve this goal the operating system must be aware of the details of how the
hardware handles system calls, exceptions, and interrupts. In most processors these
three events are handled by a single hardware mechanism. For example, on the x86, a
DRAFT as of August 29, 2017
39
https://pdos.csail.mit.edu/6.828/xv6
program invokes a system call by generating an interrupt using the int instruction. int+code
interrupt handler
Similarly, exceptions generate an interrupt too. Thus, if the operating system has a traps
plan for interrupt handling, then the operating system can handle system calls and ex- kernel mode
user mode
ceptions too.
The basic plan is as follows. An interrupts stops the normal processor loop and
starts executing a new sequence called an interrupt handler. Before starting the interrupt handler, the processor saves its registers, so that the operating system can restore them when it returns from the interrupt. A challenge in the transition to and
from the interrupt handler is that the processor should switch from user mode to kernel mode, and back.
A word on terminology: Although the official x86 term is interrupt, xv6 refers to
all of these as traps, largely because it was the term used by the PDP11/40 and therefore is the conventional Unix term. This chapter uses the terms trap and interrupt interchangeably, but it is important to remember that traps are caused by the current
process running on a processor (e.g., the process makes a system call and as a result
generates a trap), and interrupts are caused by devices and may not be related to the
currently running process. For example, a disk may generate an interrupt when it is
done retrieving a block for one process, but at the time of the interrupt some other
process may be running. This property of interrupts makes thinking about interrupts
more difficult than thinking about traps, because interrupts happen concurrently with
other activities. Both rely, however, on the same hardware mechanism to transfer control between user and kernel mode securely, which we will discuss next.
X86 protection
The x86 has 4 protection levels, numbered 0 (most privilege) to 3 (least privilege).
In practice, most operating systems use only 2 levels: 0 and 3, which are then called
kernel mode and user mode, respectively. The current privilege level with which the
x86 executes instructions is stored in %cs register, in the field CPL.
On the x86, interrupt handlers are defined in the interrupt descriptor table (IDT).
The IDT has 256 entries, each giving the %cs and %eip to be used when handling the
corresponding interrupt.
To make a system call on the x86, a program invokes the int n instruction, where
n specifies the index into the IDT. The int instruction performs the following steps:
•
Fetch the n’th descriptor from the IDT, where n is the argument of int.
•
Check that CPL in %cs is kstack
Figure 3-2. The trapframe on the kernel stack
%gs, and the general-purpose registers (3305-3310). The result of this effort is that the alltraps+code
alltraps+code
kernel stack now contains a struct trapframe (0602) containing the processor regis- alltraps+code
ter...