Review of macros
Single-line macros: %define
or %assign
for arithmetic “variable-like” macros.
Multi-line macros %macro
/%endmacro
Conditions: %if
/%elif
/%else
/%endif
Repetition: %rep
/%endrep
A sample macro: building a WHILE/ENDWHILE
loop:
WHILE rax, ge, 0
... ; loop body
ENDWHILE
Because the WHILE and ENDWHILE need to communicate with each other (ENDWHILE needs to know the label to jump to), we will have to use a context scope:
macro WHILE 3
%push while
%$while:
cmp %1, %3
j%-2 %$endwhile
endmacro
macro ENDWHILE 0
jmp %$while
%$endwhile:
%pop
endmacro
A smarter ENDWHILE
would check to make sure we are actually in a WHILE
before
popping the context stack.
Basics of building an operating system
We’ll spend the last two weeks of class hopefully working up to the point where we can boot our own “operating system” on a raw virtual machine. We’ll use QEMU for the virtual machine (it’s installed on the server, and it can communicate with GDB if we need to debug our code).
Boot process
There are two processes by which the system “boots” (i.e., transfers control from the hard-coded functionality on the system chips to the code defined on some storage medium), both of which involve running some code on a storage device (hard-drive, flash-drive, DVD, etc.):
MBR: The traditional way, in which the first 512 bytes of the booting storage device are dedicated to the master boot record, which should contain code for finding and running the rest of the operating system. This system was very limited: each drive could have only four partitions, and the 512-byte size limit on boot records meant that the boot code had to be as simple as possible. On the other hand, because the system effectively just loads some code from straight off the disk and starts running it, this is the easiest for us to write.
(MBR is sometimes used as the name of the disk-formatting scheme as well, as the two were indistinguishable.)
EFI: The newer, more flexible method, in which a special partition is created on the disk to contain information about all other “bootable” parts of the disk. One of these is selected as the default, and the code for starting it resides in the EFI partition. This makes managing the boot process much simpler and more flexible for end users, but is more complex for us. EFI also requires that disks be formatted using a different scheme, GPT.
On the other hand, while a MBR bootloader has access to a small, ill-defined set of functionality via the BIOS, EFI is standardized to provide a lot more information about the system and access to its resources. A modern OS (one written after EFI) could take advantage of this to query the system for the installed hardware and configure things before the OS kernel itself even starts booting, thus simplifying the kernel design.
We’ll stick with MBR, as that is easier for us.
MBR
An MBR-formatted disk has the first 512 bytes of the disk devoted to the master boot record. The MBR contains both the partition table, defining how the disk is divided up into up to four partitions, as well as the boot code, which is executed on system start up. Each partition, in turn, if it is marked as bootable, can have its own boot record, containing its own boot code. The typical behavior of the MBR is just to find the first bootable partition and then load and run its boot code, but its possible to do fancier things like display a menu of bootable partitions and such.
For our purposes, the boot code in the MBR will be the “operating system”; i.e., we’ll write the code we want to run directly into the boot code, rather than using a generic boot loader and writing our code into a partition’s boot record. This means that our “operating system” will be the only operating system allowed on the disk.
The MBR is limited to 512 bytes, but these 512 bytes include, just before the end, the partition table, defining what partitions are on the disk. We’ll leave this table blank (filled with 0s), as it won’t really matter for us, but you should be aware that its there. So technically we only have 440 bytes available in the MBR for our code.
Bootloader | 0-439 (440 bytes) |
---|---|
Disk ID | 440-443 (4 bytes) |
reserved, must be 0 | 444-445 (2 bytes) |
1st partition entry | 446-461 (16 bytes) |
2nd partition entry | 462-477 (16 bytes) |
3rd partition entry | 478-493 (16 bytes) |
4th partition entry | 494-509 (16 bytes) |
Signature, must be 0xaa55 | 510-511 |
16-bit real mode
When the system starts up, it is running in 16-bit “real mode” for compatibility with older software. Although eventually we will (hopefully) transition to 64-bit mode so we can do the things we expect, for now, it will be easier for us to adapt to working in 16-bit mode. Also, while in 16-bit mode, we have access to the BIOS, a built-in set of utility operations which allow us to perform input/output relatively easily. After we switch to 64-bit mode, the BIOS is unavailable, so communicating with the user becomes much more difficult and complex.
In 16-bit mode, although we have access to the 32-bit registers, all memory addresses are 16 bits. I.e., we can only access 64KB of memory! That’s clearly not what we want, so 16-bit mode makes heavy use of segmentation. Segmented memory is enabled on startup, we don’t have to do anything to turn it on.
Memory locations are of the form SEGMENT:ADDRESS
and the effective address
is computed as SEGMENT * 0x10 + ADDRESS
. E.g.,
mov word [es:si], ax
will move the word currently in ax
into the memory location es * 0x10 + si
.
es
is one of the segment registers; the segment part of an address
can be either a constant, or one of the segment registers. Most instructions
will use a default segment: e.g., mov
defaults to ds
, the data segment, so
mov word [si], ax
is actually equivalent to
mov word [ds:si], ax
Every memory access will involve either an explicit segment (constant), or a segment register.
If both SEGMENT
and ADDRESS
are limited to 16 bits, how much memory can
we access? 64KB × 0x10
= 1MB. Note that addresses in a 1MB address space
require 20 bits to be represented; the original memory controller for x86
only had 20 lines. It is possible to access addresses higher than 1MB, however,
the original behavior was to wrap these addresses around, so this is still
what many PCs do by default. To access more memory, the A20 line must be
enabled.
Access to data/code in the current segment is usually called near, while
access to data/code in a segment other than the current one is called far.
The latter requires loading the relevant segment register first, and thus is
slower. E.g., some older versions of C made a distinction between “near”
pointers (pointers to data in the current segment) and “far” pointers (a
pointer to a different segment). Note that the two kinds of pointers have
completely different representations! A near pointer is just an unsigned 16-bit
value, but a far pointer must store both the segment and the address, and
thus must be 32 bits. (Even worse, consider what happens when p == q
where
p
and q
are pointers to different, but possibly overlapping, segments.)
Segment registers
The segment registers are
CS | Code segment (used by jmp ) |
---|---|
DS | Data segment (used by mov ) |
SS | Stack segment (used by push ) |
ES | Extra segment (used by string ops) |
FS | General purpose segments |
GS |
The string operations which use both si
and di
implicitly use both the
data and extra segments: ds:si
and es:di
.
To write a value into a segment register, we have to first move it into a general-purpose register, and then into the segment register:
mov ax, 0x10000
mov ds, ax
This means that a “far” memory access requires three instructions instead of just one for a “near” access:
mov ax, 0x10000
mov fs, ax
mov dword [fs:addr], ebx
Segment registers can also be pushed/popped onto/from the stack. The cs
register cannot be directly modified, as it controls where the currently-executing
program is located in memory; i.e., the instruction pointer is effectively
cs:ip
. It is changed implicitly by a “far jmp
” (a jump of the form
jmp SEGMENT:ADDRESS
), a far call
, or a far ret
. Because the segment registers
have such a big effect on memory access, they should all be treated as callee-preserved,
and push
ed before any function calls.
Some things to note:
Segments can overlap! This happens because the calculation for effective address is
0x10 * segment + offset
.The values in the segment registers are independent of each other; a change to one has no effect on the other.
In order for the stack to work, the stack segment register
ss
must be setup in such a way that the stack doesn’t overlap with the other segments.
32-bit mode
In 32-bit mode, without paging enabled, the segment part of each address is
no longer just multiplied by 0x10
, but rather it is treated as an index into
a table of segment descriptors, called the global descriptor table. The GDT
stores, for each segment, its starting address and length, as well as a few
other bits of information.
Of course, in 32-bit mode, addresses are already
32 bits, so the easiest thing to do is to set all the segment registers to 0,
and then load GDT[0]
with a segment descriptor for a segment starting at
address 0, and having the same size as the amount of memory, thus making
every logical address map directly to the same physical address. This is called
“32-bit flat mode”.
Memory map
The 1MB of memory that is traditionally available to us in 16-bit mode can be mapped out as:
start | end | size | type | description |
---|---|---|---|---|
Low Memory (the first MiB) | ||||
0x00000000 | 0x000003FF | 1 KiB | RAM - partially unusable (see above) | Real Mode IVT (Interrupt Vector Table) |
0x00000400 | 0x000004FF | 256 bytes | RAM - partially unusable (see above) | BDA (BIOS data area) |
0x00000500 | 0x00007BFF | almost 30 KiB | RAM (guaranteed free for use) | Conventional memory |
0x00007C00 (typical location) | 0x00007DFF | 512 bytes | RAM - partially unusable (see above) | Your OS BootSector |
0x00007E00 | 0x0007FFFF | 480.5 KiB | RAM (guaranteed free for use) | Conventional memory |
0x00080000 | 0x0009FFFF | 128 KiB | RAM - partially unusable (see above) | EBDA (Extended BIOS Data Area) |
0x000A0000 | 0x000FFFFF | 384 KiB | various (unusable) | Video memory, ROM Area |
The BIOS Data Area and Extended BIOS Data Areas are essentially “RAM for the BIOS”. As the BIOS has functions and global variables of its own, it needs a place to put its stack and data segments. A few bytes of the BDA are standardized and useful:
0x040E – Address of the EBDA (so we know where not to touch)
0x041e – 32 bytes of keyboard input buffer
0x0465 – Number of hard drives
Input/Output
x86 systems support several methods for performing input/output:
Ports: the CPU has a set of ports, with dedicated instructions for reading from (
inp
) and writing to (outp
) a specific port. Ports can be connected to specific I/O devices. Often, a single device will have multiple ports: a port for sending controls (i.e., instructions to start/stop operation), a port for sending data, a port for reading status, and sometimes a separate port for receiving data.Controlling the keyboard is done via a port (not receiving keypress data; that happens via interrupts). Example: to turn on the Num-Lock light, we can do:
wait1: in al,0x64 ; Read kbd status register test al,2 ; Status of input buffer (0 = empty) jz Ok1 ; Jump if empty jmp wait1 ; Otherwise loop ok1: mov al,0xed ; Command = set LEDs out 0x60,al ; Send command wait2: in al,0x64 ; Read status test al,2 ; Empty input? jz ok2 ; Jump if empty jmp wait2 ok2: mov al,0x01 ; Command data = turn on numlock out 0x60,al ; Send command data
The loops
wait1
and2
are necessary because the keyboard controller may be doing other things, so we need to wait for it to be read (input buffer empty) before sending it commands. This pattern is fairly common when using port I/O: because the devices we are communicating with are external to the CPU, they are not necessarily available all the time, and hence we may have to wait for them to be ready.Memory-mapped I/O: some I/O devices are connected to the memory bus, intercepting memory requests for specific addresses. Reads from these addresses return not the values stored on the memory chips, but the values provided by the device. Similarly, writes to these addresses are not necessarily stored in memory, but are sent to the device.
Video RAM is typically memory-mapped; there is a specific range of (physical) addresses which is mapped to the characters or pixels current displayed on the screen. A write to one of these addresses instantly updates the screen. For example, later we’ll use this (essentially) to write text to the screen:
mov byte [0xb8000], 'H'
Video memory is mapped starting at address
0xb8000
. In text modes, this is mapped to the characters displayed; in graphics modes, it is mapped to the pixels on the display.Interrupts: An interrupt is a system-wide “hook” which can be triggered by our code, or by an I/O event (though usually not both on the same interrupt). Like syscalls, an interrupt is essentially a
jmp
to code running somewhere else, but all other aspects of the system (register values, stack, etc.) remain the same, thus, registers are used to communicate with the interrupt code (called its handler).
Interrupts
At its most basic level, the interrupt system is just an array of 256 pointers to functions (dwords, because they include both the segment and offset), stored starting at memory address 0x0. When interrupt number n occurs, the system looks up the entry at index n in the table and calls its function. Some of these functions are provided for us, by the BIOS, and are essentially to the operation of the system. Others are just do-nothing stub functions, intended to be replaced by pointers to our functions, so that we can control what happens when the specified interrupt occurs.
Interrupts can be divided into three groups:
Exceptions are generated by the CPU in response to exceptional situations (segment violation, floating-point exception, etc.).
Interrupt requests (IRQs or hardware interrupts) are generated by hardware devices, to inform the CPU and the code running on it that something has happened. The CPU has a number of lines which, when triggered cause specific interrupt numbers to be fired. I/O devices are wired to specific lines, so that an event on an I/O device fires an interrupt.
Some hardware interrupts fire a specific interrupt with no further information – if further information is available, we may have to (e.g.) query it from a port. Others will write information about the interrupt to a specific memory location and then fire the interrupt, so that the information is available directly to the interrupt-handler. This second method is called “message-based interrupts”.
Software interrupts are triggered by our code, intentionally, so request some service from the operating system or hardware. These are equivalent to syscalls; indeed, the
syscall
instruction replaces the old interrupt 0x80 used to communicate with the OS.Often software interrupts use the values in the registers
a-dx
to communicate parameters to the interrupt. It’s common forax
to specify a subfunction, allowing many different operations to be mapped to the same interrupt number. For example, all display-related functions are mapped to interrupt 0x10, with the subfunction inax
specifying the exact operation: change display mode, write text, move cursor, etc. 0x13 contains disk-related functions, and so forth.Software interrupts are also how we communicate with the BIOS, in 16-bit mode. BIOS wraps the low-level I/O devices (keyboard, display) in a set of convenience operations, so that we can print to the screen and read input from the keyboard in an easier way. On the other hand, these facilities are only available in 16-bit real mode; once we switch to either 32- or 64-bit mode, they become unavailable. It’s possible to switch back to 16-bit mode to call a BIOS routine, and then restore the 64-bit mode after it’s done, but this is so slow and cumbersome that it’s not usually worth the effort.
For a few operations, the real-mode interrupt is the only way to access the desired functionality, so we have to switch to real-mode and back. For example, the interrupt to switch video modes, 0x10, is only available via real mode, and is the only way to switch video modes.
The interrupts are mapped to the code that handles them through the
interrupt vector table.
The IVT in 16-bit mode is stored in the first 256×4 bytes of memory, starting
at physical address 0. Each dword is the address of an interrupt handler,
a procedure which the processor runs automatically when the n interrupt
occurs. The addresses are always stored as SEG:ADDR
pairs, with the offset
in the low word and the segment in the high; the effective address is computed
as 0x10 * SEG + ADDR
. Some
interrupts do double-duty: fired both when an event occurs, but also usable by
our code to call a BIOS function. Many interrupts have subfunctions, specified
by the value in ah
or other registers.
In 32- and 64-bit modes, the IVT is called the interrupt descriptor table.
Each entry is more than just an address, and the location of the table is not
fixed at address 0x0
, but specified via the idtr
register (loaded with the
lidt
instruction).
For a reference to every interrupt ever (and a website straight outta 1997), see Interrupt Jump Table.
A simple bootloader
We’ll begin by writing a simple bootloader program which will just display
Hello, world!
on the screen and then go into an infinite loop, effectively
hanging the (virtual!) machine.
Structure of a bootloader
A bootloader must be exactly 512 bytes long, and must end with the word value
0xAA55
, which tells the BIOS that this is a valid (bootable) record. The
BIOS will load the entire 512 bytes into memory, starting at address 0x7c00
Assembling the bootloader
We will have to assemble the bootloader manually, not using asm
, as we do
not want to generate an object file. Instead, we want a raw .bin
file
containing nothing but the assembled machine code. We also have to inform
the assembler of the starting address of our code, via the org
directive (remember
that the system will load the 512 bytes of the bootloader into memory at
address 0x7c00):
;;
;; hello-boot.s
;;
bits 16
org 0x7c00
; Boot code begins here
; ...
; Hang system
loop: jmp loop
; Pad remainder with 0 bytes
times 510 - ($ - $$) db 0
; Write boot signature at end
dw 0xaa55
Note that while we can create a .s
file which results in a binary larger than
512 bytes, and we can create a disk image containing this binary, the system
will only load the first 512 bytes into memory; if we want to load any additional
code from disk, we have to do it manually.
The bits
directive tells YASM to generate 16-bit code. This is not strictly
necessary, as it will default to generating 16-bit code when outputting a bin
file, but helps to make it obvious what we’re doing.
To assemble this we run
yasm hello-boot.s -f bin -o boot.bin
Writing to the screen: memory-mapped IO
To write Hello, world!
to the screen, we’ll embed the string in our code,
somewhere where it won’t be interpreted as instructions, obviously, and then
copy it to the memory-mapped display address. There are no sections in this
program; so any data has to be placed within the program, and then the program
setup so that it jumps over the data.
In text mode, the display
memory starts at address 0xB8000
. Each character cell consists of two bytes:
the first is the character, and the second is its attributes (foreground and
background color). When we copy the string to the screen, we have to copy the
characters to the character bytes, and skip over the attribute bytes.
In addition, in 16-bit mode, the allowed registers that can be used as indexes
in memory operands are very restricted: only bx
, si
, and di
can be
used, and no scale is allowed.
; Set 80x25 text mode
mov ah, 0x0
mov al, 0x3
int 0x10
; Setup fs segment to access video RAM
mov ax, screen_addr / 16
mov fs, ax
; Print text
mov cx, strlen ; Loop counter
mov bx, 0 ; String index
mov si, 0 ; Memory index
print:
mov bl, byte [bx + string] ; Load current char.
mov byte [fs:si], bl ; Print to screen
inc bx
add si, 2
loop print
; Infinite loop
forever: jmp forever
; Unreachable,
string: db "Hello, world!"
strlen: equ $-string
screen_addr: equ 0xb8000
; Padding bytes and signature...
To be on the safe side, we specifically set the video mode to text mode. Most BIOSes will do this for us, but a few will not.
Note that because the bin
file format does not support sections, there is
no .data
section: we simply place the string directly in the executable
itself, placing it after the infinite loop so that it will not be executed.
Assembly, disk image, run in QEMU
To assemble this we run
yasm hello-boot.s -f bin -o boot.bin
This creates the pure binary file boot.bin
. From this, we will have to create
a disk image so that the emulator can boot it.
dd if=boot.bin of=boot.dsk bs=512 count=2880
This disk image emulates a traditional 1.4MB floppy disk. We’ll have to repeat this step every time we modify our bootloader’s code.
Finally, we can start the emulator with the specified disk:
qemu-system-i386 -curses -drive format=raw,file=boot.dsk
(Note that this is actually loading our disk as a hard drive disk image, rather than as a floppy disk. It doesn’t make a difference right now, as they both have the same MBR structure, but later, when we want to read more from the disk than just the bootloader, we’ll need to know where we were booted from.)
To quit (because the emulator is intercepting all our key presses!) press
Esc
followed by 2
, to switch to the QEMU console, and then type quit
.
Writing to the screen, method 2: BIOS
Instead of writing directly to video memory, we can also use BIOS calls to
write one character at a time. Most of the BIOS calls for dealing with video
are via interrupt 0x10
, the same interrupt we used above to set the video
mode.
All of the interrupts we want to use will use ax
for the subfunction,
bl
as the page number (for us this should always be 0), as well as cx
and dx
for various things. We’ll have to do a bit more work to make this
fit with our registers.
To write a single character to the screen, we invoke interrupt 0x10
, subfunction
ah = 0x0a
. al
should be the character to display. Note that this doesn’t
move the cursor! To set the cursor position, we have to use subfunction
ah = 0x02
, with the cursor row/column in dh:dl
.
To write the character in al
on the screen, at the current cursor location:
mov ah, 0x0a
mov al, character...
mov cx, 1
int 0x10
We have to keep track of the cursor position ourselves, but fortunately
bx
(the current index into the string) can do double-duty as the cursor
position.
mov ah, 0x0a
mov dh, 0
mov dl, bl
Shuffling a few registers around gives us
bits 16
org 0x7c00
; Set 80x25 text mode
mov ah, 0x0
mov al, 0x3
int 0x10
; Print text
mov si, 0 ; Memory index/cursor position
print:
; Print character
mov ah, 0x0a ; Subfunction = write char
mov al, byte [si + string]
mov bh, 0 ; Page = 0
mov cx, 1 ; Write count = 1
int 0x10
; Move cursor
inc si
mov ah, 0x02 ; Subfunction = set cursor pos.
mov bh, 0 ; Page = 0
mov dh, 0 ; Cursor row = 0
mov dx, si ; Cursor col = si
mov dh, 0
int 0x10
cmp si, strlen
jne print
; Infinite loop
forever: jmp forever
; Unreachable,
string: db "Hello, world!"
strlen: equ $-string
screen_addr: equ 0xb8000
; Pad remainder with 0 bytes
times 510 - ($ - $$) db 0
; Write boot signature at end
dw 0xaa55
You may note a difference between this and the previous example: the cursor in this example is left at the end of the string printed, while in the direct-memory-write example it is left at the beginning, because writing directly to memory has no effect on the cursor position.
Connecting a debugger
It’s possible to connect a debugger (GDB) to the virtual machine, so we can debug our code. To do this, you’ll have to choose a port which no one else on the server is using. I’d suggest picking a random port between 9000 and 10000.
Start the emulator with
qemu-system-i386 -S -gdb tcp::9XXX -curses -drive format=raw,file=boot.dsk
(Replace XXX with your chosen port.)
This will start the emulator stopped, not running, waiting for a connection
from GDB. Then, open another SSH session to the server, run GDB (gdb
) and
type
target remote localhost:9XXX
using the same port number as above. From there, you can set breakpoints
(but only at numerical addresses), c
ontinue to continue the boot processes,
etc.