;;; 
;;; hello.s
;;; Prints "Hello, world!"
;;;

section .data

msg:            db      "Hello, world!", 10
MSGLEN:         equ     $-msg

section .text

;; Program code goes here

global _start
_start:

    mov     rax,    1               ; Syscall code in rax
    mov     rdi,    1               ; 1st arg, file desc. to write to
    mov     rsi,    msg             ; 2nd arg, addr. of message
    mov     rdx,    MSGLEN          ; 3rd arg, num. of chars to print
    syscall

    ;; Terminate process
    mov     rax,    60              ; Syscall code in rax
    mov     rdi,    0               ; First parameter in rdi
    syscall                         ; End process

This can be assembled and linked with

asm hello.s

or, manually, with

yasm -g dwarf2 -f elf64 hello.s -l hello.lst
ld -g -o hello hello.o

and then run via the usual

./hello

Today’s lecture will cover conversions between different number representations, the general-purpose registers available to us, the additional special-purpose registers used by the system to maintain our program, and we’ll hopefully end with our in-class first group project.

Operands

Each assembly instruction has a number of “operands”, inputs to the instruction. The biggest instructions have three operands, most have two or one, and some (like syscall) have none. Each operand can be one of the following things (with some restrictions which depend on the exact instruction).

Normally when we use a memory operand we’ll give a size qualifier with it, like byte [msg] (first byte of the message) or qword [rax] (64-bits at address rax). The size is often optional, as it can be deduced from the other operands, but it’s good practice to put it in anyway, as it helps to catch size-mismatch mistakes.

Often, the memory operands are grouped together, as generally speaking, wherever memory-direct operands are allowed, memory-indirect are allowed also. Hence, we can describe the allowed operand types by a combination of R(egister), I(mmediate), and M(emory). If we say that a given operand is RM, that means it supports register and memory operands, but not immediates.

Most instructions have the following restrictions:

Representations of Numbers

Although we’re all familiar with the decimal representation of numbers, it may be helpful to review how it works: Suppose we have a decimal number, say, 15386. The numerical value of this number is given by

$$6 (10^0) + 8 (10^1) + 3 (10^2) + 5 (10^3) + 1 (10^4)$$

In other words, to find the value of a number, we start at the first (far right) digit and multiply by \(10^0\), and the accumulate a sum, multiplying each digit by 10 to the power of its respective position. We use 10 because there are 10 possible values for each digit: 0, 1, 2, …, 8, 9.

Our other systems will work essentially the same way, the only thing that changes is how many possible values there are for each digit:

Converting to decimal

To translate a binary (base-2) number — say, 1001011b (the b suffix identifies it as binary) — into decimal, we compute

$$1(2^0) + 1(2^1) + 0(2^2) + 1(2^3) + 0(2^4) + 0(2^5) + 1(2^6) = 75$$

To translate a hexadecimal (base-16) number — say, 0x1a2b (the 0x prefix identifies it as hexadecimal) — we compute

$$11 (16^0) + 2 (16^1) + 10 (16^2) + 1 (16^3) = 6699$$
(where 11 is the value of b and 10 is the value of a).

Octal is similar, except that the base is 8. Octal is used relatively rarely, and we won’t cover it further.

Converting from decimal

Converting a decimal value into binary or hexadecimal is a bit more tricky. We’ll look at the procedure in binary, where it’s not too difficult, and then generalize it to hexadecimal.

To translate a decimal value — say 1234 — into binary, we divide by 2, and then look at the remainder: 0. This becomes the low bit, and we use the value, divided by two and rounded down (i.e., 617), as input for the next stage:

xxxxxxxxxx0

Again, we divide 617 by 2 and look at the remainder. It’s 1, so we set the next higher bit to 1, and use 617/2 = 308 as input for the next cycle:

xxxxxxxxx10

Continuing this process we have:

Input Remainder Binary
1234 0 __________0
617 1 _________10
308 0 ________010
154 0 _______0010
77 1 ______10010
38 0 _____010010
19 1 ____1010010
9 1 ___11010010
4 0 __011010010
2 0 _0011010010
1 1 10011010010

½ = 0, and we stop the process when we reach 0. If we want, we can check our work by converting the binary value back to decimal.

A similar process works for converting decimal to hexadecimal, except that we divide by 16. E.g., to convert 1234 to hex:

Input Remainder Hexadecimal
1234 2 0x___2
77 13 = d 0x__d2
4 4 0x4d2

(Dividing by 16 each time obviously gets us to 0 much faster than dividing by 2.)

Binary arithmetic

Unsigned binary addition

Unsigned binary addition (i.e., both values positive) works in a way analogous to normal decimal addition, except that there are only two digits, so 1 + 1 = 10, and we “carry” the 1. For example:

    1001011
 +  1100101
────────────

1 + 1 = 10, so we put a 0 on the answer line and carry the “extra” leading 1 digit:

         1
    1001011
 +  1100101
────────────
          0

Again, 1 + 1 = 10 so we put a 0 and carry the 1:

        11
    1001011
 +  1100101
────────────
         00

Again, 1 + 1 = 0, carry the 1:

       111
    1001011
 +  1100101
────────────
        000

Add and carry again:

      1111
    1001011
 +  1100101
────────────
       0000

1 + 0 + 0 = 1, so there’s no need to carry:

      1111
    1001011
 +  1100101
────────────
      10000

0 + 1 = 1:

      1111
    1001011
 +  1100101
────────────
     110000

And finally 1 + 1 = 10, and the carried 1 drops into the answer:

      1111
    1001011
 +  1100101
────────────
   10110000

To check our work: the top value in decimal is 75, the bottom is 101, and the answer is 176, which is correct.

This example also illustrates a problem that sometimes occurs: the result of adding two n-digit values may have n+1 digits. E.g., if we add two bytes, the result may not fit in a byte! We’ll see how the CPU deals with this situation later.

Another example:

   111111
   01011011  = 91
 + 01110110  = 118
────────────
   11010001  = 209

Unsigned binary subtraction

Subtraction follows a similar pattern, but with “borrowing” instead of carrying. E.g.,


    110110
 -  100001
────────────

0 - 1 = -1, so we borrow a 1 from the next column (i.e., we are doing 10 - 1 = 1)


    110102
 -  100001
────────────
         1

We cheat a little bit and write the borrowed-into digit as “2”, and 2-1 = 1. Similarly, the borrowed-from digit (10 = 2) becomes 0.

1 - 1 = 0:


    110100
 -  100001
────────────
        01

1 - 0 = 1:


    110110
 -  100001
────────────
       101

0 - 0 = 0:


    110110
 -  100001
────────────
      0101

1 - 0 = 1:


    110110
 -  100001
────────────
     10101

And 1 - 1 = 0 (we could drop the leading 0 in the answer):


    110110
 -  100001
────────────
    010101

To check our work: the top is 54 and the bottom is 33, and the result is 21, which is correct.

Sometimes we may need to borrow more than once in a row:


    1  1  1  0  0  0
 -  1  0  0  0  1  1
─────────────────────

The first step is 2 - 1 = 1, with the extra 1 borrowed from the next column.

               -1 
    1  1  1  0  0  2
 -  1  0  0  0  1  1
─────────────────────
                   1    

But then the next column is -1 + 0 - 1. So we borrow, again, from the next column, giving us -1 + 10 - 1 = 0:

            -1 -1 
    1  1  1  0  2  2
 -  1  0  0  0  1  1
─────────────────────
                0  1    

We have to keep borrowing over until we find a digit that is set:

         -1 -1 -1 
    1  1  1  2  2  2
 -  1  0  0  0  1  1
─────────────────────
    0  1  0  1  0  1    

What happens if the bottom value is larger than the top?

   1 
    011
  - 100 
────────
-1  111

Arithmetically, this says that 3 - 4 = 7, with an extra borrow. The answer that we get is effectively the result we would have if there was another column to borrow from (i.e., if we had done 1011 - 100 the correct result is 7).

We’re left trying to borrow a 1 that doesn’t exist. This is the opposite situation of adding two values where the result doesn’t fit; here we need an extra 1 which is not present. As we’ll see later, both these situations are treated similarly by the CPU, by setting a flag to indicate that a carry/borrow occurred in the most recent addition/subtraction operation.

Another example:

      -1-1  -1-1   
   0 1 1 1 0 1 1 0  = 118
 - 0 1 0 1 1 0 1 1  = 91 
───────────────────
   0 0 0 1 1 0 1 1  = 27

Registers and memory

Registers occupy the highest level of the memory hierarchy; they are located on the CPU itself and are directly accessible by instructions, and thus are the fastest place to store values you are working with. On the other hand, because they must be physically close to the CPU, they can’t take up too much space; x86-64 has 16 64-bit general purpose registers, 16 128-bit floating-point/SIMD registers, and a number of special-purpose registers.

General purpose registers

The general purpose registers are arranged in such a way that the full 64-bits is partitioned into the low 32 bits of that, the low 16 bits of that, and the low (and sometimes high) 8 bits of that. E.g., for rax:

rax (64 bits)
eax (32 bits)
ax (16 bits)
ah (8 bits)al(8 bits)

Only registers rax, rbx, rcx, and rdx allow access to the high byte (of the low word, so that it’s actually kind of in the middle of the whole register), via ah, bh, ch, and dh. There are some restrictions on when/where these can be used (in 64-bit mode).

64-bits Low 32-bits Low 16-bits Low 8-bits Comment
rax eax ax al syscall code and return; Accumulator
rbx ebx bx bl Base
rcx ecx cx cl Loop-Count; syscall temp register
rdx edx dx dl 3rd syscall arg.; Dword accum.
rsi esi si sil 2nd syscall arg.; Source index
rdi edi di dil 1st syscall arg.; Dest. index
rbp ebp bp bpl Stack base pointer
rsp esp sp spl Stack pointer
r8 r8d r8w r8b 5th syscall arg.
r9 r9d r9w r9b 6th syscall arg.
r10 r10d r10w r10b 4th syscall arg.
r11 r11d r11w r11b syscall temp. reg.
r15 r15d r15w r15b

When we get to functions, we’ll see that there is an additional categorization into which each register falls: whether, when we call a function, are we, the function caller responsible for saving the register if we need it (“callee-preserved registers”), or is the function we call responsible for saving the register if it needs it (“caller-preserved registers”)?

rsp and rbp, although general-purpose, are used for managing the stack; rsp points to the top of the stack, while rbp (“base pointer”) traditionally points to the beginning of the current function’s stack frame (as we’ll see, this is not automatic). rsp should not be used for anything else, but rbp is not strictly off-limits. Note that rsp points to the element on the top of the stack, and not the empty space above it.

rax is called the accumulator and several instructions implicitly use it as their destination. Similarly, rbx is sometimes called the “base” register, rcx is called the “count register”, and rdx is called the “dword accumulator”; there are a few instructions that use them implicitly, but for the most part you can use them for any purpose.

By “use implicitly”, I mean there are instructions which either read their input, or write their output, to one of these registers without mentioning it. For example, to divide rax by rbx the instruction is

idiv rbx

This will read from rax (and rdx!) and then write the division back into rax and the remainder into rdx, even though it doesn’t mention them.

rsi and rdi are the Source and Destination Indexes used by certain string operations implicitly, but you can use them as general-purpose registers in other contexts.

When a double-qword (128-bit) value is needed, it is commonly stored in a combination of rax and rdx, with the high qword in rdx (this is written as rdx:rax). We’ll see this with multiplication and division.

The SIMD/floating-point registers are named xmm0 through xmm15 and share space with the floating-point registers named fpr0-fpr7, and can only be used with special floating-point/SIMD instructions. (Generally these instructions start with f or p; e.g., fadd is floating-point addition.) They cannot be accessed by normal instructions.

Syscall register usage

Syscalls will return an error code via rax, which means that the value of rax after the syscall returns is probably not what you set it to. Similarly, the two “temporary” registers rcx and r11 are overwritten by the syscall.

Second-byte registers

The registers rax, rbx, rcx, and rdx are the only registers which allow access to the second bytes, via ah, bh, ch, and dh. However, there are some restrictions on the use of these registers: they cannot be used with any instruction which uses the REX prefix, which all the “new” 64-bit instructions use. So any instruction which uses features added by x86-64 is unable to also use the *h registers. Examples of this restriction include:

We won’t use the *h registers much, so you probably won’t run into these restrictions, but they’re worth being aware of.

syscall register use

As we’ve seen, some registers are used specially by syscalls: rax is used for the syscall code, and then rdi, rsi, rdx, r10, r8 and r9 are used for the arguments to the syscall, in that order. (As we’ll see, this order is very similar to the order used for C-style function calls; the only difference is that rcx is used instead of r10.) There are no syscalls that take more than six arguments. If the syscall returns a value, it will be in rax. Negative values generally indicate an error.

The syscall itself is allowed to overwrite the values in the rcx and r11 registers, but it will preserve all other registers. You should bear this in mind if you are using rcx or r11 with syscalls: the values you put in the registers before a syscall may not be there after the syscall!

This usage is not an official part of x86-64 assembly, but simply a convention written into the System-V Unix ABI specification which describes how programs running on x86-64 Unix systems can expect to interact with the OS. (Among other things, the specification also states that the address space of each process is 48-bits, and that each process’s .text section are mapped starting at 0x400000.)

The mov instruction

The most fundamental assembly instruction is mov, which moves data from one location (memory, register, immediate) to another (memory, register). It has the form

mov destination, source

where destination can be a register or memory, and source can be a register, memory, or immediate value. Both destination and source must be the same size, and both cannot be in memory in a single instruction. (Memory-to-memory transfers require two mov instructions.)

An important thing to remember is that, in 64-bit mode, mov is the only instruction which supports qword immediate operands. All other instructions can operate on 64-bit values only if they are already loaded into a register. Thus, most qword operations on immediate values begin with a mov. For example, you cannot add a 64-bit constant to a register directly:

add rax, some_huge_constant

you have to mov the constant into a register, and then add it:

mov rbx, some_huge_constant
add rax, rbx

A special case is when the source/destination are dword (32-bit) values, e.g.,

mov eax, ebx

In this situation, and only this situation, the high double-word of rax is implicitly set to 0. This zeroing does not occur when setting ax, or ah/al. (This behavior applies not just to mov but to many other instructions as well. E.g., xor eax, ebx will zero the high dword of rax.)

Swapping registers

Swapping (exchanging) the values in two registers (or a register and a memory location) is a common enough operation that a dedicated instruction is provided for it:

xchg a, b

exchanges the values in locations a and b. Either can be a register or memory, but both cannot be memory at the same time, and neither can be an immediate value (for obvious reasons). This allows us to swap values without needing a third “temporary” register.

Like mov, xchg on the 32-bit registers (eax, ebx, etc.) will implicitly zero the high dword.

Clearing registers

The easiest way to set a register to 0 is

mov reg, 0

A slightly more efficient way is to XOR the register with itself:

xor reg, reg

Remember that the result of XOR is 1, if and only if only one of its inputs is 1. If we XOR a value with itself, each pair of bits being XOR’d is either (0,0) (0 XOR 0 = 0) or (1,1) (1 XOR 1 = 0), so the result is to 0 all the bits.

The opcode for xor is smaller than for mov with an immediate, and thus can be loaded into the CPU faster; it also allows the CPU to perform a number of data flow optimizations that are not otherwise possible. Eventually, your brain will automatically translate xor reg, reg into reg = 0, but I don’t particularly care which you use.

Special purpose registers

The following registers exist for particular purposes, which are enforced by the CPU. Either you can’t put general data into them, or get it out of them. Typically specialized instructions (not mov) have to be used to access them.

The normal mov instruction usually cannot be used to manipulate these registers. Instead, specialized instructions exist for getting/setting their values.

In-class group project

To allow you to get your feet wet with assembly, here is our first group project:

Write an assembly program which prompts the user for their name, printing What is your name? and then accepts up to 255 characters of input, and then prints out Hello,name, nice to meet you! followed by a newline.

You’ll have to use both the SYS_WRITE (= 1) and SYS_READ (= 0) syscalls. Use the following .data section:

section .data

prompt:       db      "What is your name?"
prompt_len:   equ     $-prompt

buffer:       times 255 db '!'

resp1:        db      "Hello, "
resp1_len:    equ     $-resp1
resp2:        db      ", nice to meet you!", 10
resp2_len:    equ     $-resp2

buffer is the input buffer to pass to the SYS_READ call; it consists of 255 ! characters. Note that SYS_READ will “return” the actual number of bytes read in rax, which you will then have to use when you print out the contents of the buffer. (If you get the length of the input wrong, you’ll see either the user’s name cut off, or with !!!!s added onto the end of it.)

The “fd” parameter to both SYS_READ and SYS_WRITE is a file descriptor, a number which identifies a file or stream. The standard file descriptors which are always available are

FD Number Stream
0 Standard input
1 Standard output
2 Standard error (output)

So you’ll SYS_READ from FD #0, and SYS_WRITE to FD #1 (as we did before).

Don’t forget to end your program with a SYS_EXIT (= 60) syscall, to gracefully end your program!

Save your work in ~/cs241/group1/.