;;; 
;;; hello.s
;;; Prints "Hello, world!"
;;;

section .data

msg:            db      "Hello, world!", 10
MSGLEN:         equ     $-msg

section .text

;; Program code goes here

global _start
_start:

    mov     rax,    1               ; Syscall code in rax
    mov     rdi,    1               ; 1st arg, file desc. to write to
    mov     rsi,    msg             ; 2nd arg, addr. of message
    mov     rdx,    MSGLEN          ; 3rd arg, num. of chars to print
    syscall

    ;; Terminate process
    mov     rax,    60              ; Syscall code in rax
    mov     rdi,    0               ; First parameter in rdi
    syscall                         ; End process

This can be assembled and linked with

asm hello.s

or, manually, with

yasm -g dwarf2 -f elf64 hello.s -l hello.lst
ld -g -o hello hello.o

and then run via the usual

./hello

Today’s lecture will cover conversions between different number representations, the general-purpose registers available to us, the additional special-purpose registers used by the system to maintain our program, and we’ll hopefully end with our in-class first group project.

Operands

Each assembly instruction has a number of “operands”, inputs to the instruction. The biggest instructions have three operands, most have two or one, and some (like syscall) have none. Each operand can be one of the following things (with some restrictions which depend on the exact instruction).

A register name, like rax.
A constant, like 60 or msg. (msg, being a label, is still a constant: the assembler computes the address of the beginning of our string and writes the actual numeric value into the instruction.) Constant operands are called immediates in assembly-language terminology.

Note that because the assembler can do arithmetic, an immediate can be something like 4 * msg + 1, as the value of this is still known at assembly-time. (What you can’t do is something like rax + 1, as the value of register rax is not known until the program is running.)
A memory-direct lookup: [msg] gives the value in memory, at the address msg. I.e., this would give the first 8 bytes of our string, as a qword value.
An memory-indirect lookup: [rax] gives the value in memory, at the address stored in the rax register. There are several different forms of memory-indirect operands, which allow for accessing arrays and structures in natural ways. We’ll look at these later.

Normally when we use a memory operand we’ll give a size qualifier with it, like byte [msg] (first byte of the message) or qword [rax] (64-bits at address rax). The size is often optional, as it can be deduced from the other operands, but it’s good practice to put it in anyway, as it helps to catch size-mismatch mistakes.

Often, the memory operands are grouped together, as generally speaking, wherever memory-direct operands are allowed, memory-indirect are allowed also. Hence, we can describe the allowed operand types by a combination of R(egister), I(mmediate), and M(emory). If we say that a given operand is RM, that means it supports register and memory operands, but not immediates.

Most instructions have the following restrictions:

For two-operand instructions, both must be the same size. (There are specialized instructions for converting between different sized operands.) Some instructions only operate on specific sizes.
Many instructions can only read memory values smaller than a qword. mov and a few others are the only instructions which support accessing qword [addr].
Only mov supports 64-bit immediate values.
Both operands cannot be memory in a single instruction.
The destination operand cannot be an immediate (obviously).

Representations of Numbers

Although we’re all familiar with the decimal representation of numbers, it may be helpful to review how it works: Suppose we have a decimal number, say, 15386. The numerical value of this number is given by

$$6 (10^0) + 8 (10^1) + 3 (10^2) + 5 (10^3) + 1 (10^4)$$

In other words, to find the value of a number, we start at the first (far right) digit and multiply by $10^0$, and the accumulate a sum, multiplying each digit by 10 to the power of its respective position. We use 10 because there are 10 possible values for each digit: 0, 1, 2, …, 8, 9.

Our other systems will work essentially the same way, the only thing that changes is how many possible values there are for each digit:

For binary there are only two choices, 0 or 1, so binary is base-2.
For hexadecimal there are 16 choices, 0, 1, 2, …, 8, 9, a, b, …, e, f. Hence, hexadecimal is base-16.
For octal, there are eight choices: 0, 1, 2, …, 6, 7, so it is base-8.

Converting to decimal

To translate a binary (base-2) number — say, 1001011b (the b suffix identifies it as binary) — into decimal, we compute

$$1(2^0) + 1(2^1) + 0(2^2) + 1(2^3) + 0(2^4) + 0(2^5) + 1(2^6) = 75$$

To translate a hexadecimal (base-16) number — say, 0x1a2b (the 0x prefix identifies it as hexadecimal) — we compute

$$11 (16^0) + 2 (16^1) + 10 (16^2) + 1 (16^3) = 6699$$

(where 11 is the value of b and 10 is the value of a).

Octal is similar, except that the base is 8. Octal is used relatively rarely, and we won’t cover it further.

Converting from decimal

Converting a decimal value into binary or hexadecimal is a bit more tricky. We’ll look at the procedure in binary, where it’s not too difficult, and then generalize it to hexadecimal.

To translate a decimal value — say 1234 — into binary, we divide by 2, and then look at the remainder: 0. This becomes the low bit, and we use the value, divided by two and rounded down (i.e., 617), as input for the next stage:

xxxxxxxxxx0

Again, we divide 617 by 2 and look at the remainder. It’s 1, so we set the next higher bit to 1, and use 617/2 = 308 as input for the next cycle:

xxxxxxxxx10

Continuing this process we have:

Input	Remainder	Binary
1234	0	`__________0`
617	1	`_________10`
308	0	`________010`
154	0	`_______0010`
77	1	`______10010`
38	0	`_____010010`
19	1	`____1010010`
9	1	`___11010010`
4	0	`__011010010`
2	0	`_0011010010`
1	1	`10011010010`

½ = 0, and we stop the process when we reach 0. If we want, we can check our work by converting the binary value back to decimal.

A similar process works for converting decimal to hexadecimal, except that we divide by 16. E.g., to convert 1234 to hex:

Input	Remainder	Hexadecimal
1234	2	`0x___2`
77	13 = `d`	`0x__d2`
4	4	`0x4d2`

(Dividing by 16 each time obviously gets us to 0 much faster than dividing by 2.)

Binary arithmetic

Unsigned binary addition

Unsigned binary addition (i.e., both values positive) works in a way analogous to normal decimal addition, except that there are only two digits, so 1 + 1 = 10, and we “carry” the 1. For example:

    1001011
 +  1100101
────────────

1 + 1 = 10, so we put a 0 on the answer line and carry the “extra” leading 1 digit:

         1
    1001011
 +  1100101
────────────
          0

Again, 1 + 1 = 10 so we put a 0 and carry the 1:

        11
    1001011
 +  1100101
────────────
         00

Again, 1 + 1 = 0, carry the 1:

       111
    1001011
 +  1100101
────────────
        000

Add and carry again:

      1111
    1001011
 +  1100101
────────────
       0000

1 + 0 + 0 = 1, so there’s no need to carry:

      1111
    1001011
 +  1100101
────────────
      10000

0 + 1 = 1:

      1111
    1001011
 +  1100101
────────────
     110000

And finally 1 + 1 = 10, and the carried 1 drops into the answer:

      1111
    1001011
 +  1100101
────────────
   10110000

To check our work: the top value in decimal is 75, the bottom is 101, and the answer is 176, which is correct.

This example also illustrates a problem that sometimes occurs: the result of adding two n-digit values may have n+1 digits. E.g., if we add two bytes, the result may not fit in a byte! We’ll see how the CPU deals with this situation later.

Another example:

   111111
   01011011  = 91
 + 01110110  = 118
────────────
   11010001  = 209

Unsigned binary subtraction

Subtraction follows a similar pattern, but with “borrowing” instead of carrying. E.g.,


    110110
 -  100001
────────────

0 - 1 = -1, so we borrow a 1 from the next column (i.e., we are doing 10 - 1 = 1)


    110102
 -  100001
────────────
         1

We cheat a little bit and write the borrowed-into digit as “2”, and 2-1 = 1. Similarly, the borrowed-from digit (10 = 2) becomes 0.

1 - 1 = 0:


    110100
 -  100001
────────────
        01

1 - 0 = 1:


    110110
 -  100001
────────────
       101

0 - 0 = 0:


    110110
 -  100001
────────────
      0101

1 - 0 = 1:


    110110
 -  100001
────────────
     10101

And 1 - 1 = 0 (we could drop the leading 0 in the answer):


    110110
 -  100001
────────────
    010101

To check our work: the top is 54 and the bottom is 33, and the result is 21, which is correct.

Sometimes we may need to borrow more than once in a row:


    1  1  1  0  0  0
 -  1  0  0  0  1  1
─────────────────────

The first step is 2 - 1 = 1, with the extra 1 borrowed from the next column.

               -1 
    1  1  1  0  0  2
 -  1  0  0  0  1  1
─────────────────────
                   1

But then the next column is -1 + 0 - 1. So we borrow, again, from the next column, giving us -1 + 10 - 1 = 0:

            -1 -1 
    1  1  1  0  2  2
 -  1  0  0  0  1  1
─────────────────────
                0  1

We have to keep borrowing over until we find a digit that is set:

         -1 -1 -1 
    1  1  1  2  2  2
 -  1  0  0  0  1  1
─────────────────────
    0  1  0  1  0  1

What happens if the bottom value is larger than the top?

   1 
    011
  - 100 
────────
-1  111

Arithmetically, this says that 3 - 4 = 7, with an extra borrow. The answer that we get is effectively the result we would have if there was another column to borrow from (i.e., if we had done 1011 - 100 the correct result is 7).

We’re left trying to borrow a 1 that doesn’t exist. This is the opposite situation of adding two values where the result doesn’t fit; here we need an extra 1 which is not present. As we’ll see later, both these situations are treated similarly by the CPU, by setting a flag to indicate that a carry/borrow occurred in the most recent addition/subtraction operation.

Another example:

      -1-1  -1-1   
   0 1 1 1 0 1 1 0  = 118
 - 0 1 0 1 1 0 1 1  = 91 
───────────────────
   0 0 0 1 1 0 1 1  = 27

Registers and memory

Registers occupy the highest level of the memory hierarchy; they are located on the CPU itself and are directly accessible by instructions, and thus are the fastest place to store values you are working with. On the other hand, because they must be physically close to the CPU, they can’t take up too much space; x86-64 has 16 64-bit general purpose registers, 16 128-bit floating-point/SIMD registers, and a number of special-purpose registers.

General purpose registers

The general purpose registers are arranged in such a way that the full 64-bits is partitioned into the low 32 bits of that, the low 16 bits of that, and the low (and sometimes high) 8 bits of that. E.g., for rax:

`rax` (64 bits)
	`eax` (32 bits)
		`ax` (16 bits)
		`ah` (8 bits)	`al`(8 bits)

Only registers rax, rbx, rcx, and rdx allow access to the high byte (of the low word, so that it’s actually kind of in the middle of the whole register), via ah, bh, ch, and dh. There are some restrictions on when/where these can be used (in 64-bit mode).

64-bits	Low 32-bits	Low 16-bits	Low 8-bits	Comment
`rax`	`eax`	`ax`	`al`	`syscall` code and return; Accumulator
`rbx`	`ebx`	`bx`	`bl`	Base
`rcx`	`ecx`	`cx`	`cl`	Loop-Count; `syscall` temp register
`rdx`	`edx`	`dx`	`dl`	3rd `syscall` arg.; Dword accum.
`rsi`	`esi`	`si`	`sil`	2nd `syscall` arg.; Source index
`rdi`	`edi`	`di`	`dil`	1st `syscall` arg.; Dest. index
`rbp`	`ebp`	`bp`	`bpl`	Stack base pointer
`rsp`	`esp`	`sp`	`spl`	Stack pointer
`r8`	`r8d`	`r8w`	`r8b`	5th `syscall` arg.
`r9`	`r9d`	`r9w`	`r9b`	6th `syscall` arg.
`r10`	`r10d`	`r10w`	`r10b`	4th `syscall` arg.
`r11`	`r11d`	`r11w`	`r11b`	`syscall` temp. reg.
…	…	…	…
`r15`	`r15d`	`r15w`	`r15b`

When we get to functions, we’ll see that there is an additional categorization into which each register falls: whether, when we call a function, are we, the function caller responsible for saving the register if we need it (“callee-preserved registers”), or is the function we call responsible for saving the register if it needs it (“caller-preserved registers”)?

rsp and rbp, although general-purpose, are used for managing the stack; rsp points to the top of the stack, while rbp (“base pointer”) traditionally points to the beginning of the current function’s stack frame (as we’ll see, this is not automatic). rsp should not be used for anything else, but rbp is not strictly off-limits. Note that rsp points to the element on the top of the stack, and not the empty space above it.

rax is called the accumulator and several instructions implicitly use it as their destination. Similarly, rbx is sometimes called the “base” register, rcx is called the “count register”, and rdx is called the “dword accumulator”; there are a few instructions that use them implicitly, but for the most part you can use them for any purpose.

By “use implicitly”, I mean there are instructions which either read their input, or write their output, to one of these registers without mentioning it. For example, to divide rax by rbx the instruction is

idiv rbx

This will read from rax (and rdx!) and then write the division back into rax and the remainder into rdx, even though it doesn’t mention them.

rsi and rdi are the Source and Destination Indexes used by certain string operations implicitly, but you can use them as general-purpose registers in other contexts.

When a double-qword (128-bit) value is needed, it is commonly stored in a combination of rax and rdx, with the high qword in rdx (this is written as rdx:rax). We’ll see this with multiplication and division.

The SIMD/floating-point registers are named xmm0 through xmm15 and share space with the floating-point registers named fpr0-fpr7, and can only be used with special floating-point/SIMD instructions. (Generally these instructions start with f or p; e.g., fadd is floating-point addition.) They cannot be accessed by normal instructions.

Syscall register usage

Syscalls will return an error code via rax, which means that the value of rax after the syscall returns is probably not what you set it to. Similarly, the two “temporary” registers rcx and r11 are overwritten by the syscall.

Second-byte registers

The registers rax, rbx, rcx, and rdx are the only registers which allow access to the second bytes, via ah, bh, ch, and dh. However, there are some restrictions on the use of these registers: they cannot be used with any instruction which uses the REX prefix, which all the “new” 64-bit instructions use. So any instruction which uses features added by x86-64 is unable to also use the *h registers. Examples of this restriction include:

mov ah, sil – The low-byte versions of rsi, rdi, rsp, rbp were added by x86-64 and hence require the REX prefix to use.
mov r8b, ah – Similarly, r8 through r15 were added by x86-64.
mov ah, byte [rax] – Using the full qword width of rax, even as an address, requires the REX prefix.
The instructions that convert values of different sizes cannot convert from *h to any of the new 64-bit registers.

We won’t use the *h registers much, so you probably won’t run into these restrictions, but they’re worth being aware of.

`syscall` register use

As we’ve seen, some registers are used specially by syscalls: rax is used for the syscall code, and then rdi, rsi, rdx, r10, r8 and r9 are used for the arguments to the syscall, in that order. (As we’ll see, this order is very similar to the order used for C-style function calls; the only difference is that rcx is used instead of r10.) There are no syscalls that take more than six arguments. If the syscall returns a value, it will be in rax. Negative values generally indicate an error.

The syscall itself is allowed to overwrite the values in the rcx and r11 registers, but it will preserve all other registers. You should bear this in mind if you are using rcx or r11 with syscalls: the values you put in the registers before a syscall may not be there after the syscall!

This usage is not an official part of x86-64 assembly, but simply a convention written into the System-V Unix ABI specification which describes how programs running on x86-64 Unix systems can expect to interact with the OS. (Among other things, the specification also states that the address space of each process is 48-bits, and that each process’s .text section are mapped starting at 0x400000.)

The `mov` instruction

The most fundamental assembly instruction is mov, which moves data from one location (memory, register, immediate) to another (memory, register). It has the form

mov destination, source

where destination can be a register or memory, and source can be a register, memory, or immediate value. Both destination and source must be the same size, and both cannot be in memory in a single instruction. (Memory-to-memory transfers require two mov instructions.)

An important thing to remember is that, in 64-bit mode, mov is the only instruction which supports qword immediate operands. All other instructions can operate on 64-bit values only if they are already loaded into a register. Thus, most qword operations on immediate values begin with a mov. For example, you cannot add a 64-bit constant to a register directly:

add rax, some_huge_constant

you have to mov the constant into a register, and then add it:

mov rbx, some_huge_constant
add rax, rbx

A special case is when the source/destination are dword (32-bit) values, e.g.,

mov eax, ebx

In this situation, and only this situation, the high double-word of rax is implicitly set to 0. This zeroing does not occur when setting ax, or ah/al. (This behavior applies not just to mov but to many other instructions as well. E.g., xor eax, ebx will zero the high dword of rax.)

Swapping registers

Swapping (exchanging) the values in two registers (or a register and a memory location) is a common enough operation that a dedicated instruction is provided for it:

xchg a, b

exchanges the values in locations a and b. Either can be a register or memory, but both cannot be memory at the same time, and neither can be an immediate value (for obvious reasons). This allows us to swap values without needing a third “temporary” register.

Like mov, xchg on the 32-bit registers (eax, ebx, etc.) will implicitly zero the high dword.

Clearing registers

The easiest way to set a register to 0 is

mov reg, 0

A slightly more efficient way is to XOR the register with itself:

xor reg, reg

Remember that the result of XOR is 1, if and only if only one of its inputs is 1. If we XOR a value with itself, each pair of bits being XOR’d is either (0,0) (0 XOR 0 = 0) or (1,1) (1 XOR 1 = 0), so the result is to 0 all the bits.

The opcode for xor is smaller than for mov with an immediate, and thus can be loaded into the CPU faster; it also allows the CPU to perform a number of data flow optimizations that are not otherwise possible. Eventually, your brain will automatically translate xor reg, reg into reg = 0, but I don’t particularly care which you use.

Special purpose registers

The following registers exist for particular purposes, which are enforced by the CPU. Either you can’t put general data into them, or get it out of them. Typically specialized instructions (not mov) have to be used to access them.

rip The Instruction Pointer points to the next instruction to be executed (i.e., the instruction immediately after this one). The low 32-bits are accessible as eip, and the low word as ip, but since addresses are always 64-bits, this isn’t particularly useful. Branch instructions modify rip directly.

The rflags register: each bit of the flags register has a different meaning, and the various flags are set or unset depending on the results of certain operations. We will look at the flag register in depth when we learn about tests and conditional operations. Normally you don’t need to worry about accessing the flags register, as it is set and tested by the relevant operations automatically.

The flags register is organized as

Bit	0	2	4	6	7	10	11	21
Purpose	Carry (CF)	Parity (PF)	Adjust (AF)	Zero (ZF)	Sign (SF)	Overflow	Direction (DF)	Identification (ID)

(Unused bits are reserved.)

CF is set if the previous addition/subtraction operation ended with a carried (or borrowed) 1.
PF is set if the the last operation produced an even number of 1s.
AF is set if the last BCD addition/subtraction operation ended with a carry. We’ll look at BCD later.
OF is set if the last signed arithmetic operation overflowed (wrapped around).
DF determines the direction the repeating string operations move in (increment or decrement). We’ll see its use when we look at string operations.
ID indicates the presence of the cpuid instruction. All modern x86 CPUs support this instruction, so we can ignore this flag.

Normally we don’t need to worry about examining the flags register, as there are dedicated condition instructions (e.g., branches, moves) that execute only if a specific flag is/is not set. If you want to set/clear a specific flag, the st*/cl* family of instructions do so. E.g., stc sets the carry flag to true, while clc clears it (sets it to false).

The data segment registers ds, es, ss, fs and gs and the code segment register cs are not useful in (non-kernel) x86-64 code, but you should not use them for your own purposes, either. (Windows and Linux both use fs and gs to point to thread-local storage, but that’s not a standard thing, just a convention.) They control how your process’s memory is mapped to the system’s memory address space.
The control registers cr0 through cr15 are not accessible in user-mode code at all; they control whether the CPU is running in protected (kernel) or unprotected (user) mode, whether it is running in 16-, 32-, or 64-bit mode, etc.
There are a few additional registers that have to do with memory management, debugging breakpoints, internal performance parameters, etc. Most of these are not useful for us, and many are inaccessible by user code anyway.

The normal mov instruction usually cannot be used to manipulate these registers. Instead, specialized instructions exist for getting/setting their values.

In-class group project

To allow you to get your feet wet with assembly, here is our first group project:

Write an assembly program which prompts the user for their name, printing What is your name? and then accepts up to 255 characters of input, and then prints out Hello,name, nice to meet you! followed by a newline.

You’ll have to use both the SYS_WRITE (= 1) and SYS_READ (= 0) syscalls. Use the following .data section:

section .data

prompt:       db      "What is your name?"
prompt_len:   equ     $-prompt

buffer:       times 255 db '!'

resp1:        db      "Hello, "
resp1_len:    equ     $-resp1
resp2:        db      ", nice to meet you!", 10
resp2_len:    equ     $-resp2

buffer is the input buffer to pass to the SYS_READ call; it consists of 255 ! characters. Note that SYS_READ will “return” the actual number of bytes read in rax, which you will then have to use when you print out the contents of the buffer. (If you get the length of the input wrong, you’ll see either the user’s name cut off, or with !!!!s added onto the end of it.)

The “fd” parameter to both SYS_READ and SYS_WRITE is a file descriptor, a number which identifies a file or stream. The standard file descriptors which are always available are

FD Number	Stream
0	Standard input
1	Standard output
2	Standard error (output)

So you’ll SYS_READ from FD #0, and SYS_WRITE to FD #1 (as we did before).

Don’t forget to end your program with a SYS_EXIT (= 60) syscall, to gracefully end your program!

Save your work in ~/cs241/group1/.