Operating systems: part 2

Review of last time

We built a simple bootloader which used BIOS services to print the text “Hello, world!” to the screen:

; Set 80x25 text mode
mov ah, 0x0
mov al, 0x3
int 0x10

; Print text
mov si, 0          ; Memory index/cursor position
print:
    ; Print character
    mov ah, 0x0a    ; Subfunction = write char
    mov al, byte [si + string]
    mov bh, 0       ; Page = 0
    mov cx, 1       ; Write count = 1
    int 0x10

    ; Move cursor
    inc si
    mov ah, 0x02    ; Subfunction = set cursor pos.
    mov bh, 0       ; Page = 0
    mov dh, 0       ; Cursor row = 0
    mov dx, si      ; Cursor col = si
    mov dh, 0
    int 0x10

    cmp si, strlen
    jne print

; Infinite loop
forever: jmp forever

; Unreachable, 
string:         db      "Hello, world!"
strlen:         equ     $-string
screen_addr:    equ     0xb8000

Today, we’re going to try to get ourselves into 32-bit protected mode (at startup, the system is in 16-bit real mode for compatibility with old software). This involves a fair bit of code, so first we’ll need to write a two stage bootloader. The first stage, which is loaded automatically and can be at most 440 bytes, exists only to (manually) load the second stage from disk and then start running it. The second stage will have no size restrictions.

Two-Stage Bootloader

Because the BIOS only loads the first 512 bytes of the disk into memory automatically, we will have to load the remainder ourselves. This will involve invoking interrupt 0x13, which is used for disk-related operations. We have to perform two steps:

Reset the disk (subfunction ah = 0x0)
Perform an extended read to load n blocks into memory (subfunction ah = 0x42)

Interrupt 0x13 covers disk-related functions, and subfunction ah=0x42 does an extended read from disk into memory. dl should be 0x80 (drive number), ds:si should contain the address of a structure describing what we want to load and where to load it:

struct disk_addr_pkt {
    unsigned char sz;       // Size of packet = 0x10
    unsigned char _res;     // Reserved, do not use
    unsigned short blk_cnt; // How many blocks to transfer?
    void*          buffer;  // Address to load into
    unsigned long  blk_num; // Starting block number
};

“Blocks” are not bytes; 1 block = 512 bytes. (The disk may think of “blocks” differently, but so much code is written using this assumption that the BIOS does the translation for us.)

The disk address packet must be aligned in memory to a multiple of 2 (i.e., on a word boundary). The “size” of the packet is stored inside the packet, because there are two versions of the packet structure: the 16/32-bit version we are using above, and a 64-bit version where the address and block-count can be 64-bit quantities.

;;;
;;; two-stage.s
;;; Illustrates a two-stage loader, where the first stage invokes the BIOS
;;; to load the second stage.
;;;

bits 16
org 0x7c00

start:
origin:     equ         0x7c00
blk_count:  equ         (end - loaded_code) / 512 + 1

; -----------------------------------------------------------------------------
; First stage loader

; Reset disk
mov ah, 0x0         ; Subfunction reset
mov dl, 0x80        ; Disk number
int 0x13

; Load blocks 
mov ah, 0x42        ; Int 0x13, subfunction Extended Read
mov dl, 0x80        ; Drive num
mov si, disk_packet ; Packet address
int 0x13

jmp loaded_code

; ----------------------------------------------------------------------------
; Begin "pseudo-data" section

string:         db      "Hello, world!"
strlen:         equ     $-string
screen_addr:    equ     0xb8000

align 2 
disk_packet:    db      0x10            ; Packet size
                db      0               ; Reserved
                dw      blk_count       ; Block count
                dd      loaded_code     ; Addr. to load
                dd      1               ; Starting block

; Pad remainder with 0 bytes
times 510 - ($ - $$)    db 0

; Write boot signature at end
dw 0xaa55

; -----------------------------------------------------------------------------
; Begin second-stage loader

loaded_code:

; Set 80x25 text mode
mov ah, 0x0
mov al, 0x3
int 0x10

; Print text
mov si, 0          ; Memory index/cursor position
print:
    ; Print character
    mov ah, 0x0a    ; Subfunction = write char
    mov al, byte [si + string]
    mov bh, 0       ; Page = 0
    mov cx, 1       ; Write count = 1
    int 0x10

    ; Move cursor
    inc si
    mov ah, 0x02    ; Subfunction = set cursor pos.
    mov bh, 0       ; Page = 0
    mov dh, 0       ; Cursor row = 0
    mov dx, si      ; Cursor col = si
    mov dh, 0
    int 0x10

    cmp si, strlen
    jne print

; Infinite loop
forever: jmp forever

end:

; Pad so there's a good number of blocks used in the disk
times 1024 * 1024  db 0

Entering 32-bit protected mode

Now that we have a lot more room for code (and data), we can work on switching the system to 32-bit protected mode. The basic steps for entering 32-bit protected mode are

Disable interrupts. We don’t want an interrupt to fire while we are changing the system mode, as the interrupt handler won’t work correctly.
Enable the A20 line, to allow for the larger address space. (Remember that in 16-bit mode, we can only access a 20-bit address space.)
Load the Global Descriptor Table (GDT) with segment offsets. In 32-bit mode, instead of the values in the segment registers being used directly, as segment addresses, they are indexes into a table, the GDT, where each entry in the table contains information about that segment.
Switch to 32-bit mode, by setting the low bit of register CR0.

See http://www.osdever.net/tutorials/view/the-world-of-protected-mode for a tutorial on entering 32-bit protected mode.

Disabling interrupts

There are several situations where we want to ensure that interrupts do not interrupt our code. The simplest way to get this is to disable them entirely. Whether or not interrupts are enabled is controlled by the interrupt flag IF, which can be cleared with the cli instruction, and set (re-enabled) with sti. So to turn off interrupts temporarily we do

cli
; Remainder of code here

Note that this leaves the non-maskable interrupts still enabled. NMIs are interrupts which are so important they should never be disabled. They are typically fired by RAM errors and other unrecoverable hardware errors. Disabling NMIs is more complex:

in al, 0x70
or al, 0x80  ; Set bit 4
out 0x70, al

Renabling NMI is

in al, 0x70
and al, 0x7f ; Unset bit 4
out 0x70, al

Fortunately, it’s usually OK to leave NMIs enabled.

Enable A20 line

In 16-bit mode, addresses are 20 bits wide. Bits are numbers starting at 0, so each address has bits 0 through 19. The A20 line is the “hidden” 20th bit of the address line. Old software expects there to be only lines 0-19, so it has to be explicitly enabled to gain access to the extended memory.

There are several ways to enable A20, all of them weird. The system designers didn’t have an easy way to add an extension like this, so they had to find some other part of the system which had an unused port available. In order from least to most sketchy, these methods are

Check to see it’s already enabled. Some systems start with A20 enabled. This can be done by comparing the values at two addresses which map to the same physical address if A20 is disabled, but to different addresses if it is enabled. E.g.,

 mov ax, 0
 mov es, ax         ; Extra segment = 0
 not ax
 mov ds, ax         ; Data segment = 0xffff

 mov di, 0x7dfe     ; This points to the "valid bootloader"    
 mov si, 0x7e0e     ; bytes at the end of the bootloader

 mov al, byte [ds:si]
 inc byte [es:di]
 cmp al, byte [es:di]
 jne a20_enabled

 ; A20 not enabled, so enable it...

 a20_enabled:
 ...

BIOS function: Interrupt 15 has a subfunction which can be used to enable the A20 line. Set ax = 0x2401 and trigger interrupt 15. Int 15 has a few other A20 related functions: subfunction 0x2403 can be used to check whether the BIOS supports this operation, and if it does, 0x2402 can be used to check the current status of the A20 line.
Keyboard controller: The original, and most crazy way, uses a spare port on the keyboard controller. Disable interrupts, then send keyboard commands 0xad, 0xd0, read a character, send commands 0xd1, 0xae, and reenable interrupts.
Fast A20: Fast A20 support is available on newer computers, and uses bit 1 on port 92:
```
 in al, 0x92
 or al, 2       ; Set bit 1
 out 0x92, al
```
Checking bit 1 will also tell you if the A20 is already enabled. The downside to this method is that there is no way to check whether it works! And if it doesn’t work, it may do something completely different, like clearing the screen, or crashing the system. Furthermore, despite the name, the “Fast” A20 may in fact take a while to have effect, so you should do a loop afterwards which checks the A20 status and doesn’t continue until it’s enabled.

QEMU supports all of these methods, so we’ll probably go with the Fast A20 method, although a real operating system would need to use all the methods in succession, as no one method is a sure thing.

Install GDT

The GDT is an array of segment descriptors, where each segment descriptor should have the following form:

struct seg_desc {
    unsigned short limit;           // Segment size (low 16 bits)
    unsigned short base_low;        // Low 16 bits of segment base address
    unsigned char  base_mid;        // Middle 8 bits of seg. base
    unsigned char  type : 5;        // Segment type, attributes
    unsigned char  priv : 2;        // Privilege level
    unsigned char  present : 1;     // Is segment present?
    unsigned char  limit_high : 4;  // High 4 bits of seg. size
    unsigned char  attr : 3;        // More attributes
    unsigned char  granularity : 1; // Affects segment size
    unsigned char  base_high;       // High 8 bits of segment base
};

Note that segment bases have 32-bits, spread out over the fields base, base_mid, and base_high. Segment limits (sizes) have 20 bits, meaning that the largest segment is 1MB. The entire structure is 64-bits or one qword.

Entry 0 in the table is reserved for the null segment; if you try to use the null segment, a processor exception will occur, so we load the first entry of the table with 0:

gdt:
    dq 0    ; Null segment

We will define four entries above this in the segment table:

A segment for code, starting at 0x7c00 and extending to TODO
A segment for data, starting at TODO and extending to TODO.
A segment for the stack, overlapping the previous. We’ll set the flags so that the stack grows backwards, from the end of the segment.
A segment for direct access to video memory.

Once we have the GDT setup in memory, we need to load it into the CPU, using the lgdt instruction. The GDT descriptor tells the CPU both the address at which the GDT exists, and also how many entries are in it. The low 16 bits are the size of the GDT (in bytes, not in entries!), while the high 48 bits are the address in physical memory.

Enabling protected mode

This is as simple as setting bit 0 of control register 0 (CR0). We cannot modify the control registers directly, so we have to load it into eax, set the bit there, and then load it back:

mov eax, cr0
or eax, 1
mov cr0, eax

After switching modes, it’s important to clear the pipeline; any instructions still in the pipeline are real-mode, and won’t make sense in the new mode. To do this, all we have to do is issue a far jump, a jmp with an explicit segment. This can even be to the same segment we are in, it just has to be there.

jmp 0x08:in_protected_mode

[bits 32] 
in_protected_mode:
...

(We use segment 0x8, because the code segment is index 1 in the GDT, and each GDT entry is 8 bytes wide. The segment part of the address is not actually an index into the table, but a direct byte offset from the beginning of the table.)

Each segment entry in the table is called a selector and has a rather complex structure, with a total size of 64-bits:

63-56	55-52	51-48	47-40	39-32	31-16	15-0
Base 24:31	Flags	Limit 16:19	Access	Base 16:23	Base 0:15	Limit 0:15

The most important fields are the base and limit, which specify the (linear) base address and size of this segment. Both of these fields are split up: the low 16 bits are stored first, and then later the middle 8, and then the high 8 of the base. The total base is 32-bits, while the total limit is 20 bits. (If paging is enabled, then the limit is not in bytes, but in pages.)

The Flags bits are GS00 where the the G field specifies the units for limit: G = 0 means that limit is in bytes, while G = 1 means that limit is in 4KB (pages), and the S bit should be 0 for 16-bit protected mode, and 1 for 32-bit protected mode. (It’s possible to mix 16- and 32-bit segments, which can be useful when interoperating with 16-bit code.)

The access byte is broken up as

7	6,5	4	3	2	1	0
Pr	Priv	S	Ex	DC	RW	Ac

Pr: Present, is the segment actually available for use. This should be 1 for any selectors which are actually in use.
Priv: Privilege, stores the “ring level” of this segment. High-level segments cannot be accessed by code running at lower levels. 0 is the highest (kernel), while 3 is the lowest (user-level code).
S: Segment type. This should be set for code/data segments, and cleared for any system segments.
Ex: Executable. If this bit is not set, the CPU will refuse to execute code located in this segment. Should be set for code segments and cleared (for security) on data/stack segments. Note that code segments forbid writing, so to load a program into memory, you have to first set the segment to non-executable, then load it, then set it to executable only after the load is complete.
Dc: “Direction/Conforming”. Has a different meaning depending on whether this is a code segment (Ex = 1) or data segment (Ex = 0)
- For data segments, specifies whether the base of the segment is the beginning (0) or end, and correspondingly, whether the limit should be interpreted as positive or negative.
- For code segments, then code in this segment can only be run from code in the same privilege level.
RW: For code segments, set this bit to allow reading from the segment. Write access is never allowed for code segments.
For data segments, set this bit to allow writing to the segment (reading from data segments is always allowed).
Ac: Accessed. Set by the CPU to 1 whenever the segment is accessed.

Configuring segment registers

We can now setup the segment registers with offsets into the GDT. The address of the GDT is stored in the gdtr register, loaded with lgdt. The value of gdtr stores both the size (minus 1) of the GDT, and its linear address, in the format size:address, where size is 16 bits and address is 32. The address is linear, meaning that if paging is enabled, it will be translated through the page table to get a physical address. (Thus, x86 in 32-bit mode supports segmentation on top of paging.)

Printing to the screen

We now only have one choice as to how to print to the screen: the BIOS interrupts won’t work in 32-bit mode, so we have to write directly to video memory (hence, the extra segment pointing to video RAM). Later, we’ll see that it’s possible to temporarily switch back to real-mode, provided we are careful about how we set things up.

Setting up 32-bit interrupts

After the switch to 32-bit mode, interrupts are still disabled, because the interrupt handlers in the IVT cannot run in 32-bit mode. Instead, before we can re-enable interrupts, we have to create an interrupt descriptor table, the 32-bit analogue to the IVT. Unlike the IVT, which is partly (mostly) setup by the system itself, the IDT is totally under our control, and only needs to contain handlers for hardware interrupts and exceptions; there’s no need for it to handle software interrupts (like BIOS calls) unless we want to. (Handling software interrupts is how a )

Unlike the IVT, which is hard-coded to be located at address 0x0 with a limit (size) of 0x3ff, the location

Each entry of the IDT is called a gate, telling where the interrupt service routines (handlers) are located. We have to setup the GDT first, because the addresses used in the IDT will be translated through the GDT.

The idtr register defines where the IDT is located in (physical) memory, and its size in bytes, minus 1. The low 16 bits contain the size, while the high 32 bits define the base address. The first entry of the table is for interrupt 0. The idtr register is loaded by using the lidt instruction, which takes a value (either in register or memory) corresponding to the above format (size:address). (If paging is enabled, the address of the idtr is translated through the page table as well!)

In 32-bit mode, each gate is 64-bits wide, structured as

struct idt_gate {
    unsigned short offset_low;  // Low 16 bits of offset
    unsigned short segment;     // Code segment selector (into GDT)
    unsigned char  reserved;    // Reserved, must be 0
    unsigned char  type_attr;   // Type and attribute flags
    unsigned short offset_high; // High 16-bits of offset
};

The handler is located in the segment indicated by segment in the GDT. The offset into this segment is specified by the combination of offset_high:offset_low. The type_attr member is broken up as

7	6,5	4	3,2,1,0
P	DPL	S	Gate type

P should be set to 0 for gates that are unused (unused interrupts)
DPL sets the privilege level in which the interrupt handler will run. The processor will prevent high-privilege interrupts from being called by unprivileged (user) code.
S should be set to 0 for interrupt- and trap-type gates.
Type is one of the following bit patterns:

0101 32-bit task gate

0110 16-bit interrupt gate

0111 16-bit trap gate

1110 32-bit interrupt gate

1111 32-bit trap gate

Task gates ignore the handler given in offset, and instead just cause an immediate task switch.

Interrupt gates are gates corresponding to software interrupts (services called by user code). For an interrupt gate, the CPU will automatically disable interrupts before calling the handler, and re-enable them when it returns.

Trap gates are gates called by hardware interrupts or CPU exceptions.

The IDT is installed by calling the lidt instruction, passing it the

(64-bit mode uses an extension of the IDT which allows for 64-bit addresses.)

Setting up paging

Segmentation is, as mentioned, a somewhat out-dated memory management method. We’d prefer to use paging, for its additional flexibility. In order to use paging, we’ll have to setup the page table, and then enable it. Remember, also that segmentation is applied on top of paging, so we’ll have to disable segmentation.

Switching back to real mode

There are some services which are only available in BIOS, thus it will be very useful to have routines which can switch back to real mode, and then return to protected mode after they are done. There are several options for doing this:

We can actually switch back to real mode. We may need to setup our memory layout for the first 1MB so that everything we need is there.
Switch to “unreal mode”. Unreal mode is a variant of real mode with a few flags set which allow it to access more than its normal memory limit. This means we don’t need to rearrange things quite as much.
Switch to virtual 8086 mode. In virtual 8086 mode the processor pretends to be in 16-bit mode, using a set of segments/interrupts we configure. This was intended to allow 16-bit programs to run unchanged in 32-bit mode, but still allow the OS to have control over them, but we can use it to run BIOS functions without too much work.

Virtual 8086 mode is interesting, but it’s really intended for 32/64-bit operating systems to run 16-bit programs; it’s not intended for the OS itself to use to talk to BIOS, so it’s rather complex to use it in that way. Instead, we’ll use the first method, switching to real mode and then back.

Switching back to real mode

The basics steps are:

Disable (32-bit) interrupts (cli). It can be useful to also disable the non-maskable interrupts (those not controlled by the IF flag), but this is optional. We only have to do this if we installed a 32-bit IVT; if we left the interrupts disabled after switching to protected mode, they will obviously still be disabled.
Disable paging if in use. We don’t have to throw away all our paging, we just have to make sure that the code we will be running in real mode makes sense: that it is running on a page that maps directly to physical addresses (no translation), and that the GDT (32-bit segment table) and IDT (32-bit interrupt table) are also in pages which do no translation.
If GDT is using a table larger than 16-bits, create a new GDT that is 16-bit compatible.
Far-jump to 16-bit protected mode (this is 16-bit mode but with protection still enabled). We do a far jump, again, to flush the pipeline.
Setup segment registers (according to the 16-bit segment table).
Setup IDT for real-mode. Unless you’ve changed something, the original system IDT should still be located at address 0x0, with limit (size) 0x3ff. These can be loaded using the lidt instruction.
Disable protected mode (PE bit in CR0)
Far-jump to real-mode (again, to flush pipeline)
Reload segment registers
Setup stack pointer
Re-enable interrupts (sti and NMIs, if disabled)

Whew! Of course, to get back into 32-bit protected mode, we have to reload our GDT, IDT, and 32-bit segments, set the PE bit, and then far-jump to 32-bit mode. A 16-32-16 transition is slow, which makes it something we want to avoid if possible. The only things we absolutely need it for is switching video modes and a few other BIOS functions. Most BIOS services can be replicated by our own code in 32-bit mode, and later in 64-bit mode.

Interrupt handlers and the PIC

There are 256 possible interrupts that can be fired. Which of these do we need to write handlers for, and how do we write them?

Writing an interrupt handler is fairly easy: just end it with the iret instruction rather than the normal ret instruction.

The part of the system which interprets hardware interrupts is called the Programmable Interrupt Controller (PIC). It essentially filters/remaps interrupts as they are received, before running the actual interrupt handler. To add to the complexity of the system, there are actually two PIC chips in the system; one was not enough, but rather than redesign the chip, they just chained two of them together. One is called the master PIC and the other the slave.

Communication with both PICs is done via a pair of ports: one for commands and one for data:

Master Cmd	0x20
Master Data	0x21
Slave Cmd	0xA0
Slave Data	0xA1

Note that between the two PICs, there are a total of 15 possible hardware interrupts. (Each PIC provides 8, and one is used for communication between the two PICs.) One of the most basic operations of the PIC is to determine the mapping from hardware interrupt numbers (0-15) to system interrupt numbers (i.e., entries in the IVD). (Internally, interrupt 2 is used for inter-PIC communication but the PIC normally remaps hardware interrupt 9 to 2, so if you receive interrupt 2, it was originally 9.)

The master PIC is responsible for interrupts 0-7, while the slave is responsible for 8-15. Each PIC has a vector offset which is added to the (hardware) interrupt number to get the (IVD) interrupt index seen by the CPU. This means that the first 8, and second 7 hardware interrupts can be mapped to different parts of the IVD. Both vector offsets must be multiples of 8.

In 16-bit mode, the default mapping is to map hardware interrupts 0-7 to system interrupts 8 - 0xf, and hardware 8-15 to 0x70 - 0x7f. In 32-bit mode, the first 32 system interrupts are reserved, so at a minimum we have to remap hardware interrupts 0-7 to a different part of the table.

In order to remap the PIC, you have to reinitialize it from scratch, essentially “rebooting” that component of the system. Thus, remapping the PIC is a rather complex procedure.

When we start protected mode, we have to reinitialize the PICs by sending the initialize command, 0x11. After this command, we send three initialization words, telling the the PIC

Its vector offset
How the master/slave connection is setup
Some additional information

See here for the details.

When an interrupt routine ends, before it calls iret, it needs to signal to the PIC that it is finished. This is done by issuing command PIC_EOI, end-of-interrupt, 0x20. This command must be sent to the PIC (master, slave) which originated the interrupt, so depending on which PIC it came from, we either do

mov al, 0x20     ; PIC_EOI
out 0x20, al   ; Signal master

or

mov al, 0x20     ; PIC_EOI
out 0x70, al   ; Signal master

Masking interrupts

The PIC has the ability to mask (temporarily ignore) certain interrupts. Each PIC has a mask register which is 8 bits wide. Each bit of the mask corresponds to one of the interrupt lines connected to that PIC. If a bit is set, then the PIC will ignore any signals on the corresponding line; if it is unset, the corresponding register functions normally.

Masking is done via the data port: read a byte from the data port to get the current mask, and then write the (modified) value back to the data port to set the mask.

0101	32-bit task gate
0110	16-bit interrupt gate
0111	16-bit trap gate
1110	32-bit interrupt gate
1111	32-bit trap gate