Review of last time

Getting into 32-bit protected mode

Getting into protected mode is the first step to getting into 64-bit “long” mode. The steps for doing so are

Disable interrupts. We don’t want an interrupt to fire while we are changing the system mode, as the interrupt handler won’t work correctly. The cli instruction will do this.
Enable the A20 line, to allow for the larger address space. (Remember that in 16-bit mode, we can only access a 20-bit address space.) There are several ways of doing this, of varying levels of safety and complexity. We choose to use the “fast A20” port, which works on the emulator we’re using, so this is just
```
  in al, 0x92
  or al, 2       ; Set bit 1
  out 0x92, al
```
Load the Global Descriptor Table (GDT) with segment offsets. In 32-bit mode, instead of the values in the segment registers being used directly, as segment addresses, they are indexes into a table, the GDT, where each entry in the table contains information about that segment.

This is the most complex part of the process, as the table entries have a lot of information about the segments in them.
Switch to 32-bit mode, by setting the low bit of register CR0. This is just
```
  mov eax, cr0
  or eax, 1
  mov cr0, eax
```
but afterwards we want to do a “long jump” to flush the pipeline:
```
  jmp 0x8:protected_mode

  bits 32
  protected_mode:
  ...
```
Segment 0x8 is code segment we configured in the GDT.
Install a 32-bit compatible interrupt table (IVD).
Re-enable interrupts via sti

Disable interrupts

The normal interrupt handlers installed by the BIOS will only work in 16-bit mode. We don’t want them accidentally firing in the middle of the switching process, so we disable them. (Note that even if nothing “obvious” happens to fire an interrupt, no keyboard input or whatever, interrupts can still be fired. E.g., there is a timing interrupt which fires 18 times every second.)

Interrupts are controlled by the IF flag in the flags register, so to disable them we just clear this flag:

cli

At the end, after we’ve installed our own 32-bit interrupt handlers, we can re-enable them by setting the flag with

sti

Install GDT

Recall that in 32-bit mode, the segment registers do not contain actual addresses, but indexes into a table of segment descriptors. This table is called the global descriptor table and must be built, by us, to describe the segments available.

Theoretically this is just a matter of writing a GDT in our (pseudo) data section, copying it into memory in a suitable location, and then using the lgdt instruction to load it into the CPU. However, the table’s entries have quite a complex structure which will required some understanding. The basic decisions we need to make for each segment are

Starting address (base) and size (limit)
Type: code or data
For data, read-only or read-write?

The GDT is an array of segment descriptors, where each segment descriptor should have the following form:

struct seg_desc {
    unsigned short limit;           // Segment size (low 16 bits)
    unsigned short base_low;        // Low 16 bits of segment base address
    unsigned char  base_mid;        // Middle 8 bits of seg. base
    unsigned char  type : 5;        // Segment type, attributes
    unsigned char  priv : 2;        // Privilege level
    unsigned char  present : 1;     // Is segment present?
    unsigned char  limit_high : 4;  // High 4 bits of seg. size
    unsigned char  attr : 3;        // More attributes
    unsigned char  granularity : 1; // Affects segment size
    unsigned char  base_high;       // High 8 bits of segment base
};

63-56	55	54-52	51-48	47	46-45	44-40	39-32	31-16	15-0
Base 24:31	Gran	Attr	Limit 16:19	Present	Priv	Type	Base 16:23	Base 0:15	Limit 0:15

Note that segment bases have 32-bits, spread out over the fields base, base_mid, and base_high. Segment limits (sizes) have 20 bits, meaning that the largest segment is 1MB, but this is affected by the granularity bit. If the granularity bit is set, then the limit is interpreted not as number-of-bytes, but number of 4KB frames, which allows for a max segment size of 1MB * 4KB or 4GB, which is the total amount of memory addressable in 32-bit mode anyway.

The entire structure is 64-bits or one qword.

Entry 0 in the table is reserved for the null segment; if you try to use the null segment, a processor exception will occur, so we load the first entry of the table with 0:

gdt:
    dq 0    ; Null segment

We will define four entries above this in the segment table:

A segment for code, starting at 0x7c00 and extending to 0x7FFFF (size 0x78400 bytes). This is the normal place where our first and second stages will be loaded.
A segment for data, starting at 0x100000 and extending to 0xEFFFFF (about 14MB or 0xE00000 bytes). This is located in the “extended memory”, above 1MB, and hence is only accessible after we have enabled the A20 line. Note that “global” data will still also be located inside the code segment, because it gets loaded with the rest of the bootloader.

As this segment is larger than the normal segment max size, it will have to use the granularity bit. Thus, the limit will be 0xE00000 / 4096 = 0xE00.
A segment for the stack, overlapping the previous. We’ll set things up so that the stack grows backwards, from the end of the segment.
A segment for direct access to video memory, starting at address 0xb8000 extending to 0xBFFFF with a size of 32KB (0x8000).

Once we have the GDT setup in memory, we need to load it into the CPU, using the lgdt instruction. The GDT descriptor tells the CPU both the address at which the GDT exists, and also how many entries are in it. The low 16 bits are the size of the GDT (in bytes, not in entries!), while the high 48 bits are the address in physical memory.

gdt:
    dq 0                ; Segment 0

    ; Segment 1 -- Code
    dw 0x8400           ; Low 16 bits of limit (total 0x78400)
    dw 0x7c00           ; Low 16 bits of base
    db 0                ; Middle 8 bits of base
    db 0b10011010       ; Present (1 bit), Priv (2), S,   Ex, Dc, Rw, Ac
    db 0b01000111       ; Gran (1), 16/32 (1), 0s (2),   Limit high (4)
    db 0                ; High 8 bits of base

    ; Segment 2 -- Data
    dw 0x0E00
    dw 0x0000
    db 0x10
    db 0b10010010
    db 0b11000000
    db 0 

    ; Segment 3 -- Stack (identical to previous)
    dw 0x0E00
    dw 0x0000
    db 0x10
    db 0b10010010
    db 0b11000000
    db 0 

    ; Segment 4 -- Video
    dw 0x8000
    dw 0x80000
    db 0xb
    db 0
    db 0b10000010
    db 0b01000000
    db 0

gdt_limit:  equ     $ - gdt

We can load this directly into the GDTR register by using a combination of gdt and gdt_limit. The lgdt instruction expects an address of a GDTR structure, which consists of a word, giving the GDT limit (size of the GDT, in bytes), followed by a dword, giving the base (linear) address of the GDT table itself. We can allocate this structure immediately after the GDT:

gdtr_struct:
    dw      gdt_limit
    dd      gdt

and then load it with

mov ax, 0
mov ds, ax          ; Make address of gdtr linear
lgdt gdtr_struct

If we were writing a kernel capable of running multiple processes, we’d want to set up some additional segments:

A TSS segment stores information about each task (process or thread), used by the CPU during task switches.
Local descriptor table entries; each task can have its own set of segments, which it sees as if they were the entire GDT. This can be used to give each task a separate set of segments (its own code, data, stack, etc.).

Finally, we can setup the segment registers. Note that the values in the segment registers are not strictly indexes into the GDT, but rather byte offsets. Thus, the segment numbers above should be multiplied by 8, the size of a single descriptor.

Finally we reload all the segment registers

mov ax, 1 * 8
mov cs, ax
mov ax, 2 * 8
mov ds, ax
mov ax, 2 * 8
mov ss, ax
mov ax, 3 * 8
mov es, ax

And then “jump” into the “new” code segment:

jmp 0x8:protected_mode

bits 32
protected_mode:
...

Install 32-bit mode IVD

We cannot reenable interrupts until we have a set of 32-bit-compatible interrupt handlers, and a table pointing to them. We have to actually write the interrupt handlers (although many of these can just be do-nothing routines), and then fill in a table to point to them.

The interrupt descriptors have the form:

63-48	47-40	39-32	31-16	15-0
High offset	Type/Attr	Zero	Segment	Low offset

The location of the handlers are specified as a segment selector (i.e., offset into GDT) and then offset within the segment. The offset is 32-bits, but split up within the structure. All of our handlers will be within our code segment, so Segment will be 0x8, and then the addresses will just be relative to the start of that segment.

The Type/Attribute byte looks like

7	6	5	4	3	2	1	0
Present	Priv		S	Type

Present should be set to 1 for used interrupt descriptors.
Priv. gives the privilege level of the handler. This has the same meaning as privilege levels in the segment descriptors, what privilege level does the interrupt run at.
S should be set to 0 for interrupt and trap handlers.
Type will be one of 0b1110 (software interrupt), or 0b1111 (hardware interrupt). The other gate types are for task-switch interrupts, or for 16-bit interrupts (the IDT can have both 32- and 16- bit handlers in it). Remember that software interrupts are routines that client code will call, whereas hardware interrupts are generated by the system itself.

Like the GDT, the lidt instruction expects the address of an IDTR structure in memory, which should have the structure

lidt:
    ...
lidt_size:  equ     $ - lidt

lidt_struct:
    dw lidt_size    ; Limit/size
    dd lidt         ; Base address

which we then load with

lidt lidt_struct

Interrupt handlers and the PIC

There are 256 possible interrupts that can be fired. Which of these do we need to write handlers for, and how do we write them?

Writing an interrupt handler is fairly easy: just end it with the iret instruction rather than the normal ret instruction.

The part of the system which interprets hardware interrupts is called the Programmable Interrupt Controller (PIC). It essentially filters/remaps interrupts as they are received, before running the actual interrupt handler. To add to the complexity of the system, there are actually two PIC chips in the system; one was not enough, but rather than redesign the chip, they just chained two of them together. One is called the master PIC and the other the slave.

Communication with both PICs is done via a pair of ports: one for commands and one for data:

Master Cmd	0x20
Master Data	0x21
Slave Cmd	0xA0
Slave Data	0xA1

Note that between the two PICs, there are a total of 15 possible hardware interrupts. (Each PIC provides 8, and one is used for communication between the two PICs.) One of the most basic operations of the PIC is to determine the mapping from hardware interrupt numbers (0-15) to system interrupt numbers (i.e., entries in the IVD). (Internally, interrupt 2 is used for inter-PIC communication but the PIC normally remaps hardware interrupt 9 to 2, so if you receive interrupt 2, it was originally 9.)

The master PIC is responsible for interrupts 0-7, while the slave is responsible for 8-15. Each PIC has a vector offset which is added to the (hardware) interrupt number to get the (IVD) interrupt index seen by the CPU. This means that the first 8, and second 7 hardware interrupts can be mapped to different parts of the IVD. Both vector offsets must be multiples of 8.

In 16-bit mode, the default mapping is to map hardware interrupts 0-7 to system interrupts 8 - 0xf, and hardware 8-15 to 0x70 - 0x7f. In 32-bit mode, the first 32 system interrupts are reserved, so at a minimum we have to remap hardware interrupts 0-7 to a different part of the table.

In order to remap the PIC, you have to reinitialize it from scratch, essentially “rebooting” that component of the system. Thus, remapping the PIC is a rather complex procedure.

When we start protected mode, we have to reinitialize the PICs by sending the initialize command, 0x11. After this command, we send three initialization words, telling the the PIC

Its vector offset
How the master/slave connection is setup
Some additional information

See here for the details.

When an interrupt routine ends, before it calls iret, it needs to signal to the PIC that it is finished. This is done by issuing command PIC_EOI, end-of-interrupt, 0x20. This command must be sent to the PIC (master, slave) which originated the interrupt, so depending on which PIC it came from, we either do

mov al, 0x20     ; PIC_EOI
out 0x20, al   ; Signal master

or

mov al, 0x20     ; PIC_EOI
out 0x70, al   ; Signal master

Masking interrupts

The PIC has the ability to mask (temporarily ignore) certain interrupts. Each PIC has a mask register which is 8 bits wide. Each bit of the mask corresponds to one of the interrupt lines connected to that PIC. If a bit is set, then the PIC will ignore any signals on the corresponding line; if it is unset, the corresponding register functions normally.

Masking is done via the data port: read a byte from the data port to get the current mask, and then write the (modified) value back to the data port to set the mask.

Multi-stage bootloader

;;;
;;; two-stage.s
;;; Illustrates a two-stage loader, where the first stage invokes the BIOS
;;; to load the second stage.
;;;

bits 16
org 0x7c00

start:
origin:     equ         0x7c00
blk_count:  equ         (end - loaded_code) / 512 + 1

; -----------------------------------------------------------------------------
; First stage loader

; Reset disk
mov ah, 0x0         ; Subfunction reset
mov dl, 0x80        ; Disk number
int 0x13

; Load blocks 
mov ah, 0x42        ; Int 0x13, subfunction Extended Read
mov dl, 0x80        ; Drive num
mov si, disk_packet ; Packet address
int 0x13

jmp loaded_code

; ----------------------------------------------------------------------------
; Begin "pseudo-data" section

string:         db      "Hello, world!"
strlen:         equ     $-string
screen_addr:    equ     0xb8000

align 2 
disk_packet:    db      0x10            ; Packet size
                db      0               ; Reserved
                dw      blk_count       ; Block count
                dd      loaded_code     ; Addr. to load
                dd      1               ; Starting block

; Pad remainder with 0 bytes
times 510 - ($ - $$)    db 0

; Write boot signature at end
dw 0xaa55

; -----------------------------------------------------------------------------
; Begin second-stage loader

loaded_code:

; Set 80x25 text mode
mov ah, 0x0
mov al, 0x3
int 0x10

; Print text
mov si, 0          ; Memory index/cursor position
print:
    ; Print character
    mov ah, 0x0a    ; Subfunction = write char
    mov al, byte [si + string]
    mov bh, 0       ; Page = 0
    mov cx, 1       ; Write count = 1
    int 0x10

    ; Move cursor
    inc si
    mov ah, 0x02    ; Subfunction = set cursor pos.
    mov bh, 0       ; Page = 0
    mov dh, 0       ; Cursor row = 0
    mov dx, si      ; Cursor col = si
    mov dh, 0
    int 0x10

    cmp si, strlen
    jne print

; Infinite loop
forever: jmp forever

end:

; Pad so there's a good number of blocks used in the disk
times 1024 * 1024  db 0