Operating systems: part 1

Review of macros

Single-line macros: %define or %assign for arithmetic “variable-like” macros.

Multi-line macros %macro/%endmacro

Conditions: %if/%elif/%else/%endif

Repetition: %rep/%endrep

A sample macro: building a WHILE/ENDWHILE loop:

WHILE rax, ge, 0
    ... ; loop body
ENDWHILE

Because the WHILE and ENDWHILE need to communicate with each other (ENDWHILE needs to know the label to jump to), we will have to use a context scope:

macro WHILE 3
    %push while
    %$while:
    cmp %1, %3
    j%-2 %$endwhile
endmacro

macro ENDWHILE 0
    jmp %$while
    %$endwhile:
    %pop
endmacro

A smarter ENDWHILE would check to make sure we are actually in a WHILE before popping the context stack.

Basics of building an operating system

We’ll spend the last two weeks of class hopefully working up to the point where we can boot our own “operating system” on a raw virtual machine. We’ll use QEMU for the virtual machine (it’s installed on the server, and it can communicate with GDB if we need to debug our code).

Boot process

There are two processes by which the system “boots” (i.e., transfers control from the hard-coded functionality on the system chips to the code defined on some storage medium), both of which involve running some code on a storage device (hard-drive, flash-drive, DVD, etc.):

MBR: The traditional way, in which the first 512 bytes of the booting storage device are dedicated to the master boot record, which should contain code for finding and running the rest of the operating system. This system was very limited: each drive could have only four partitions, and the 512-byte size limit on boot records meant that the boot code had to be as simple as possible. On the other hand, because the system effectively just loads some code from straight off the disk and starts running it, this is the easiest for us to write.

(MBR is sometimes used as the name of the disk-formatting scheme as well, as the two were indistinguishable.)
EFI: The newer, more flexible method, in which a special partition is created on the disk to contain information about all other “bootable” parts of the disk. One of these is selected as the default, and the code for starting it resides in the EFI partition. This makes managing the boot process much simpler and more flexible for end users, but is more complex for us. EFI also requires that disks be formatted using a different scheme, GPT.

On the other hand, while a MBR bootloader has access to a small, ill-defined set of functionality via the BIOS, EFI is standardized to provide a lot more information about the system and access to its resources. A modern OS (one written after EFI) could take advantage of this to query the system for the installed hardware and configure things before the OS kernel itself even starts booting, thus simplifying the kernel design.

We’ll stick with MBR, as that is easier for us.

MBR

An MBR-formatted disk has the first 512 bytes of the disk devoted to the master boot record. The MBR contains both the partition table, defining how the disk is divided up into up to four partitions, as well as the boot code, which is executed on system start up. Each partition, in turn, if it is marked as bootable, can have its own boot record, containing its own boot code. The typical behavior of the MBR is just to find the first bootable partition and then load and run its boot code, but its possible to do fancier things like display a menu of bootable partitions and such.

For our purposes, the boot code in the MBR will be the “operating system”; i.e., we’ll write the code we want to run directly into the boot code, rather than using a generic boot loader and writing our code into a partition’s boot record. This means that our “operating system” will be the only operating system allowed on the disk.

The MBR is limited to 512 bytes, but these 512 bytes include, just before the end, the partition table, defining what partitions are on the disk. We’ll leave this table blank (filled with 0s), as it won’t really matter for us, but you should be aware that its there. So technically we only have 440 bytes available in the MBR for our code.

Bootloader	0-439 (440 bytes)
Disk ID	440-443 (4 bytes)
reserved, must be 0	444-445 (2 bytes)
1st partition entry	446-461 (16 bytes)
2nd partition entry	462-477 (16 bytes)
3rd partition entry	478-493 (16 bytes)
4th partition entry	494-509 (16 bytes)
Signature, must be 0xaa55	510-511

16-bit real mode

When the system starts up, it is running in 16-bit “real mode” for compatibility with older software. Although eventually we will (hopefully) transition to 64-bit mode so we can do the things we expect, for now, it will be easier for us to adapt to working in 16-bit mode. Also, while in 16-bit mode, we have access to the BIOS, a built-in set of utility operations which allow us to perform input/output relatively easily. After we switch to 64-bit mode, the BIOS is unavailable, so communicating with the user becomes much more difficult and complex.

In 16-bit mode, although we have access to the 32-bit registers, all memory addresses are 16 bits. I.e., we can only access 64KB of memory! That’s clearly not what we want, so 16-bit mode makes heavy use of segmentation. Segmented memory is enabled on startup, we don’t have to do anything to turn it on.

Memory locations are of the form SEGMENT:ADDRESS and the effective address is computed as SEGMENT * 0x10 + ADDRESS. E.g.,

mov word [es:si], ax

will move the word currently in ax into the memory location es * 0x10 + si. es is one of the segment registers; the segment part of an address can be either a constant, or one of the segment registers. Most instructions will use a default segment: e.g., mov defaults to ds, the data segment, so

mov word [si], ax

is actually equivalent to

mov word [ds:si], ax

Every memory access will involve either an explicit segment (constant), or a segment register.

If both SEGMENT and ADDRESS are limited to 16 bits, how much memory can we access? 64KB × 0x10 = 1MB. Note that addresses in a 1MB address space require 20 bits to be represented; the original memory controller for x86 only had 20 lines. It is possible to access addresses higher than 1MB, however, the original behavior was to wrap these addresses around, so this is still what many PCs do by default. To access more memory, the A20 line must be enabled.

Access to data/code in the current segment is usually called near, while access to data/code in a segment other than the current one is called far. The latter requires loading the relevant segment register first, and thus is slower. E.g., some older versions of C made a distinction between “near” pointers (pointers to data in the current segment) and “far” pointers (a pointer to a different segment). Note that the two kinds of pointers have completely different representations! A near pointer is just an unsigned 16-bit value, but a far pointer must store both the segment and the address, and thus must be 32 bits. (Even worse, consider what happens when p == q where p and q are pointers to different, but possibly overlapping, segments.)

Segment registers

The segment registers are

GS	General purpose segments
CS	Code segment (used by `jmp`)
DS	Data segment (used by `mov`)
SS	Stack segment (used by `push`)
ES	Extra segment (used by string ops)
FS	General purpose segments

The string operations which use both si and di implicitly use both the data and extra segments: ds:si and es:di.

To write a value into a segment register, we have to first move it into a general-purpose register, and then into the segment register:

mov ax, 0x10000
mov ds, ax

This means that a “far” memory access requires three instructions instead of just one for a “near” access:

mov ax, 0x10000
mov fs, ax
mov dword [fs:addr], ebx

Segment registers can also be pushed/popped onto/from the stack. The cs register cannot be directly modified, as it controls where the currently-executing program is located in memory; i.e., the instruction pointer is effectively cs:ip. It is changed implicitly by a “far jmp” (a jump of the form jmp SEGMENT:ADDRESS), a far call, or a far ret. Because the segment registers have such a big effect on memory access, they should all be treated as callee-preserved, and pushed before any function calls.

Some things to note:

Segments can overlap! This happens because the calculation for effective address is 0x10 * segment + offset.
The values in the segment registers are independent of each other; a change to one has no effect on the other.
In order for the stack to work, the stack segment register ss must be setup in such a way that the stack doesn’t overlap with the other segments.

32-bit mode

In 32-bit mode, without paging enabled, the segment part of each address is no longer just multiplied by 0x10, but rather it is treated as an index into a table of segment descriptors, called the global descriptor table. The GDT stores, for each segment, its starting address and length, as well as a few other bits of information.

Of course, in 32-bit mode, addresses are already 32 bits, so the easiest thing to do is to set all the segment registers to 0, and then load GDT[0] with a segment descriptor for a segment starting at address 0, and having the same size as the amount of memory, thus making every logical address map directly to the same physical address. This is called “32-bit flat mode”.

Memory map

The 1MB of memory that is traditionally available to us in 16-bit mode can be mapped out as:

start	end	size	type	description
Low Memory (the first MiB)
0x00000000	0x000003FF	1 KiB	RAM - partially unusable (see above)	Real Mode IVT (Interrupt Vector Table)
0x00000400	0x000004FF	256 bytes	RAM - partially unusable (see above)	BDA (BIOS data area)
0x00000500	0x00007BFF	almost 30 KiB	RAM (guaranteed free for use)	Conventional memory
0x00007C00 (typical location)	0x00007DFF	512 bytes	RAM - partially unusable (see above)	Your OS BootSector
0x00007E00	0x0007FFFF	480.5 KiB	RAM (guaranteed free for use)	Conventional memory
0x00080000	0x0009FFFF	128 KiB	RAM - partially unusable (see above)	EBDA (Extended BIOS Data Area)
0x000A0000	0x000FFFFF	384 KiB	various (unusable)	Video memory, ROM Area

The BIOS Data Area and Extended BIOS Data Areas are essentially “RAM for the BIOS”. As the BIOS has functions and global variables of its own, it needs a place to put its stack and data segments. A few bytes of the BDA are standardized and useful:

0x040E – Address of the EBDA (so we know where not to touch)
0x041e – 32 bytes of keyboard input buffer
0x0465 – Number of hard drives

Input/Output

x86 systems support several methods for performing input/output:

Ports: the CPU has a set of ports, with dedicated instructions for reading from (inp) and writing to (outp) a specific port. Ports can be connected to specific I/O devices. Often, a single device will have multiple ports: a port for sending controls (i.e., instructions to start/stop operation), a port for sending data, a port for reading status, and sometimes a separate port for receiving data.

Controlling the keyboard is done via a port (not receiving keypress data; that happens via interrupts). Example: to turn on the Num-Lock light, we can do:
```
 wait1:
     in al,0x64             ; Read kbd status register
     test al,2              ; Status of input buffer (0 = empty)
     jz Ok1                 ; Jump if empty
     jmp wait1              ; Otherwise loop

 ok1:
     mov al,0xed            ; Command = set LEDs
     out 0x60,al            ; Send command

 wait2:
     in al,0x64             ; Read status
     test al,2              ; Empty input?
     jz ok2                 ; Jump if empty
     jmp wait2                 

 ok2:
     mov al,0x01            ; Command data = turn on numlock
     out 0x60,al            ; Send command data
```
The loops wait1 and 2 are necessary because the keyboard controller may be doing other things, so we need to wait for it to be read (input buffer empty) before sending it commands. This pattern is fairly common when using port I/O: because the devices we are communicating with are external to the CPU, they are not necessarily available all the time, and hence we may have to wait for them to be ready.
Memory-mapped I/O: some I/O devices are connected to the memory bus, intercepting memory requests for specific addresses. Reads from these addresses return not the values stored on the memory chips, but the values provided by the device. Similarly, writes to these addresses are not necessarily stored in memory, but are sent to the device.

Video RAM is typically memory-mapped; there is a specific range of (physical) addresses which is mapped to the characters or pixels current displayed on the screen. A write to one of these addresses instantly updates the screen. For example, later we’ll use this (essentially) to write text to the screen:
```
 mov byte [0xb8000], 'H'
```
Video memory is mapped starting at address 0xb8000. In text modes, this is mapped to the characters displayed; in graphics modes, it is mapped to the pixels on the display.
Interrupts: An interrupt is a system-wide “hook” which can be triggered by our code, or by an I/O event (though usually not both on the same interrupt). Like syscalls, an interrupt is essentially a jmp to code running somewhere else, but all other aspects of the system (register values, stack, etc.) remain the same, thus, registers are used to communicate with the interrupt code (called its handler).

Interrupts

At its most basic level, the interrupt system is just an array of 256 pointers to functions (dwords, because they include both the segment and offset), stored starting at memory address 0x0. When interrupt number n occurs, the system looks up the entry at index n in the table and calls its function. Some of these functions are provided for us, by the BIOS, and are essentially to the operation of the system. Others are just do-nothing stub functions, intended to be replaced by pointers to our functions, so that we can control what happens when the specified interrupt occurs.

Interrupts can be divided into three groups:

Exceptions are generated by the CPU in response to exceptional situations (segment violation, floating-point exception, etc.).
Interrupt requests (IRQs or hardware interrupts) are generated by hardware devices, to inform the CPU and the code running on it that something has happened. The CPU has a number of lines which, when triggered cause specific interrupt numbers to be fired. I/O devices are wired to specific lines, so that an event on an I/O device fires an interrupt.

Some hardware interrupts fire a specific interrupt with no further information – if further information is available, we may have to (e.g.) query it from a port. Others will write information about the interrupt to a specific memory location and then fire the interrupt, so that the information is available directly to the interrupt-handler. This second method is called “message-based interrupts”.
Software interrupts are triggered by our code, intentionally, so request some service from the operating system or hardware. These are equivalent to syscalls; indeed, the syscall instruction replaces the old interrupt 0x80 used to communicate with the OS.

Often software interrupts use the values in the registers a-dx to communicate parameters to the interrupt. It’s common for ax to specify a subfunction, allowing many different operations to be mapped to the same interrupt number. For example, all display-related functions are mapped to interrupt 0x10, with the subfunction in ax specifying the exact operation: change display mode, write text, move cursor, etc. 0x13 contains disk-related functions, and so forth.

Software interrupts are also how we communicate with the BIOS, in 16-bit mode. BIOS wraps the low-level I/O devices (keyboard, display) in a set of convenience operations, so that we can print to the screen and read input from the keyboard in an easier way. On the other hand, these facilities are only available in 16-bit real mode; once we switch to either 32- or 64-bit mode, they become unavailable. It’s possible to switch back to 16-bit mode to call a BIOS routine, and then restore the 64-bit mode after it’s done, but this is so slow and cumbersome that it’s not usually worth the effort.

For a few operations, the real-mode interrupt is the only way to access the desired functionality, so we have to switch to real-mode and back. For example, the interrupt to switch video modes, 0x10, is only available via real mode, and is the only way to switch video modes.

The interrupts are mapped to the code that handles them through the interrupt vector table. The IVT in 16-bit mode is stored in the first 256×4 bytes of memory, starting at physical address 0. Each dword is the address of an interrupt handler, a procedure which the processor runs automatically when the n interrupt occurs. The addresses are always stored as SEG:ADDR pairs, with the offset in the low word and the segment in the high; the effective address is computed as 0x10 * SEG + ADDR. Some interrupts do double-duty: fired both when an event occurs, but also usable by our code to call a BIOS function. Many interrupts have subfunctions, specified by the value in ah or other registers.

In 32- and 64-bit modes, the IVT is called the interrupt descriptor table. Each entry is more than just an address, and the location of the table is not fixed at address 0x0, but specified via the idtr register (loaded with the lidt instruction).

For a reference to every interrupt ever (and a website straight outta 1997), see Interrupt Jump Table.

A simple bootloader

We’ll begin by writing a simple bootloader program which will just display Hello, world! on the screen and then go into an infinite loop, effectively hanging the (virtual!) machine.

Structure of a bootloader

A bootloader must be exactly 512 bytes long, and must end with the word value 0xAA55, which tells the BIOS that this is a valid (bootable) record. The BIOS will load the entire 512 bytes into memory, starting at address 0x7c00

Assembling the bootloader

We will have to assemble the bootloader manually, not using asm, as we do not want to generate an object file. Instead, we want a raw .bin file containing nothing but the assembled machine code. We also have to inform the assembler of the starting address of our code, via the org directive (remember that the system will load the 512 bytes of the bootloader into memory at address 0x7c00):

;; 
;; hello-boot.s
;;
bits 16
org 0x7c00

; Boot code begins here
; ...

; Hang system
loop:       jmp loop

; Pad remainder with 0 bytes
times 510 - ($ - $$)    db 0

; Write boot signature at end
dw 0xaa55

Note that while we can create a .s file which results in a binary larger than 512 bytes, and we can create a disk image containing this binary, the system will only load the first 512 bytes into memory; if we want to load any additional code from disk, we have to do it manually.

The bits directive tells YASM to generate 16-bit code. This is not strictly necessary, as it will default to generating 16-bit code when outputting a bin file, but helps to make it obvious what we’re doing.

To assemble this we run

yasm hello-boot.s -f bin -o boot.bin

Writing to the screen: memory-mapped IO

To write Hello, world! to the screen, we’ll embed the string in our code, somewhere where it won’t be interpreted as instructions, obviously, and then copy it to the memory-mapped display address. There are no sections in this program; so any data has to be placed within the program, and then the program setup so that it jumps over the data.

In text mode, the display memory starts at address 0xB8000. Each character cell consists of two bytes: the first is the character, and the second is its attributes (foreground and background color). When we copy the string to the screen, we have to copy the characters to the character bytes, and skip over the attribute bytes.

In addition, in 16-bit mode, the allowed registers that can be used as indexes in memory operands are very restricted: only bx, si, and di can be used, and no scale is allowed.

; Set 80x25 text mode
mov ah, 0x0
mov al, 0x3
int 0x10

; Setup fs segment to access video RAM
mov ax, screen_addr / 16
mov fs, ax

; Print text
mov cx, strlen     ; Loop counter
mov bx, 0          ; String index
mov si, 0          ; Memory index
print:
    mov bl, byte [bx + string]   ; Load current char.    
    mov byte [fs:si], bl   ; Print to screen    
    inc bx
    add si, 2
loop print

; Infinite loop
forever: jmp forever

; Unreachable, 
string:         db      "Hello, world!"
strlen:         equ     $-string
screen_addr:    equ     0xb8000

; Padding bytes and signature...

To be on the safe side, we specifically set the video mode to text mode. Most BIOSes will do this for us, but a few will not.

Note that because the bin file format does not support sections, there is no .data section: we simply place the string directly in the executable itself, placing it after the infinite loop so that it will not be executed.

Assembly, disk image, run in QEMU

To assemble this we run

yasm hello-boot.s -f bin -o boot.bin

This creates the pure binary file boot.bin. From this, we will have to create a disk image so that the emulator can boot it.

dd if=boot.bin of=boot.dsk bs=512 count=2880

This disk image emulates a traditional 1.4MB floppy disk. We’ll have to repeat this step every time we modify our bootloader’s code.

Finally, we can start the emulator with the specified disk:

qemu-system-i386 -curses -drive format=raw,file=boot.dsk

(Note that this is actually loading our disk as a hard drive disk image, rather than as a floppy disk. It doesn’t make a difference right now, as they both have the same MBR structure, but later, when we want to read more from the disk than just the bootloader, we’ll need to know where we were booted from.)

To quit (because the emulator is intercepting all our key presses!) press Esc followed by 2, to switch to the QEMU console, and then type quit.

Writing to the screen, method 2: BIOS

Instead of writing directly to video memory, we can also use BIOS calls to write one character at a time. Most of the BIOS calls for dealing with video are via interrupt 0x10, the same interrupt we used above to set the video mode.

All of the interrupts we want to use will use ax for the subfunction, bl as the page number (for us this should always be 0), as well as cx and dx for various things. We’ll have to do a bit more work to make this fit with our registers.

To write a single character to the screen, we invoke interrupt 0x10, subfunction ah = 0x0a. al should be the character to display. Note that this doesn’t move the cursor! To set the cursor position, we have to use subfunction ah = 0x02, with the cursor row/column in dh:dl.

To write the character in al on the screen, at the current cursor location:

mov ah, 0x0a
mov al, character...
mov cx, 1 
int 0x10

We have to keep track of the cursor position ourselves, but fortunately bx (the current index into the string) can do double-duty as the cursor position.

mov ah, 0x0a
mov dh, 0
mov dl, bl

Shuffling a few registers around gives us

bits 16
org 0x7c00

; Set 80x25 text mode
mov ah, 0x0
mov al, 0x3
int 0x10

; Print text
mov si, 0          ; Memory index/cursor position
print:
    ; Print character
    mov ah, 0x0a    ; Subfunction = write char
    mov al, byte [si + string]
    mov bh, 0       ; Page = 0
    mov cx, 1       ; Write count = 1
    int 0x10

    ; Move cursor
    inc si
    mov ah, 0x02    ; Subfunction = set cursor pos.
    mov bh, 0       ; Page = 0
    mov dh, 0       ; Cursor row = 0
    mov dx, si      ; Cursor col = si
    mov dh, 0
    int 0x10

    cmp si, strlen
    jne print

; Infinite loop
forever: jmp forever

; Unreachable, 
string:         db      "Hello, world!"
strlen:         equ     $-string
screen_addr:    equ     0xb8000

; Pad remainder with 0 bytes
times 510 - ($ - $$)    db 0

; Write boot signature at end
dw 0xaa55

You may note a difference between this and the previous example: the cursor in this example is left at the end of the string printed, while in the direct-memory-write example it is left at the beginning, because writing directly to memory has no effect on the cursor position.

Connecting a debugger

It’s possible to connect a debugger (GDB) to the virtual machine, so we can debug our code. To do this, you’ll have to choose a port which no one else on the server is using. I’d suggest picking a random port between 9000 and 10000.

Start the emulator with

qemu-system-i386 -S -gdb tcp::9XXX -curses -drive format=raw,file=boot.dsk

(Replace XXX with your chosen port.) This will start the emulator stopped, not running, waiting for a connection from GDB. Then, open another SSH session to the server, run GDB (gdb) and type

target remote localhost:9XXX

using the same port number as above. From there, you can set breakpoints (but only at numerical addresses), continue to continue the boot processes, etc.