Today we’ll cover:

Instructions covered last time:

Review

Registers

Last time we looked at the full set of general-purpose registers available to us: rax, rbx, rcx, rdx, rdi, rsi, rbp, rsp, r8 through r15.

We saw how each of these can be accessed either its full 64-bit (qword) width, or as the low dword, or the low word, or the low byte:

rax (64 bits)
eax (32 bits)
ax (16 bits)
ah (8 bits)al(8 bits)

We saw that many instructions (including mov and xor), when operating on the low dword portion of a register, will implicitly zero the high dword, and we looked at some strategies for preserving the high dword.

We looked at the flags register rif which is used to store information about the results of various operations. The most important flags for us will be

We saw that the mov instruction is our basic tool for moving data around, between registers, memory, and immediate (constant) values. We also saw the xchg instruction, which swaps operands, saving us from having to use a register as a temporary variable.

Negative values

All of the arithmetic operations described previously assume that the incoming value is positive. How are negative values handled? Historically, there have been four choices:

Multiplication is an expensive enough operation that we generally don’t bother trying to do it “inside” the number representation. Instead, we just make both operands positive, multiply, and then negate the result if needed.

Another example:

   11  1
   01110110  = 118
 + 11100101  = -27
────────────
   01011011  = 91

To negate a (positive or negative):

This works regardless of whether the input value is positive or negative. E.g.,

11100101  = -27
00011010        (flip all bits)
00011011  =  27 (add 1)

Expanding data sizes

Suppose we have a 8-bit value and we wish to store it into a 16-bit location. If the value is unsigned then this is easy: we copy the value into the low 8 bits, and we fill the high 8 bits with 0s. But what if the value is signed (twos-complement)? In this case, in order to get the equivalent value, we must sign extend the number, filling the high bits with copies of the high bit in the original value. If the high bit was originally 1, then all the high 8 bits must be 1, otherwise they should be 0.

Many arithmetic operations that can “mix” values of different word sizes will come in two forms: an unsigned form that “zero-extends” (fills with 0s) and a signed form that sign-extends (copies the high-bit).

In-memory representations

For single-byte values, the above representations are used. For multi-byte values, however, there are a few options. Consider a 16-bit value. When we place this into a memory address a, it can be done in two ways:

If big-endianness seems insane, consider that this is how a 16-bit value would be written from left-to-right, assuming memory addresses increase to the right:

high byte low byte
addraddr + 1

Little-endian is used by Intel systems, and thus we don’t need to worry about big-endian. Big-endian is used by a few microcontrollers (AVR32) and a few big-iron processors. If you’re writing a file format (or a network protocol) then you have to define a “standard” endianness and have to ensure that the correct translations are done in software, when necessary. But for us, if you want to access the high byte of a 16-bit value in memory, it can be found at address + 1.

Accessing memory

In 64-bit mode all addresses are 64-bits, hence the full register (rax, rbx, etc.) must be used to store an address. As we’ve seen, the label used to define a string in the .data section is actually the address of that string, hence we can load the address of a string my_text into rax with

mov rax, my_text 

You can think of rax in this usage as a pointer-typed variable, holding the address of something.

We can “dereference” a memory address by putting it in square brackets:

mov al, byte [my_text]

The byte qualifier is not strictly required here, but it’s good practice to add it. An easy mistake to make is

mov rax, [my_text] ; Read one *qword* from my_text

which reads not one byte but eight (qword), where as

mov rax, byte [my_text] ; Read one byte from my_text

will give an error when you try to assemble.

(Of course, if you need more than just the first byte, you might want to get them in chunks of eight, for speed…)

In reality, if we were processing a string, we would want to iterate through it, rather than just accessing the first byte. It would be more useful to put the address my_text into a register and then “dereference” that:

mov rsi, my_text
mov al, byte [rsi]

We can then inc rsi to increment rsi and access the next byte in the string. Because my_text is an immediate, we cannot increment it. (Again, the byte qualifier on [rsi] is not required, as it could be inferred from the size of al.)

Simple looping

Because doing anything interesting will require looping, we’ll introduce the loop instruction. loop takes a single operand, a label to jump to (internally, loop stores the offset of the label’s address from the current instruction’s address). The operation of loop is to perform the following steps:

Thus, the structure of a basic loop would look something like this:

    mov rcx, init       ; Initialize rcx > 0

.start_loop:

    ; ... Perform loop operation using rcx

    loop .start_loop

    ; ... Continue after end of loop

This is roughly equivalent to a C/C++-style do-while loop:

rcx = init;
do {

    // ... Perform loop operation

    --rcx;
} while(rcx != 0);

Note that because rcx is one of the registers syscall is allowed to clobber, if you do any syscalls inside the loop, you will need to save rcx before the call, and then restore it after, before loop.

As a demo of this, we can modify our “Hello, world” program to print “Hello, world!” backwards, by printing one character at a time, from the end to the beginning. (We’ll still use the write syscall, we’ll just tell it to print a single character instead of the entire string.)

It may be useful to consider how we would do this in C or C++. The C++ equivalent to our original assembly program would be something like this:

int main()
{
    char* msg = "Hello, world!";
    const int MSGLEN = 13; 

    cout.write(msg,MSGLEN); // equiv. to write syscall
}

To write one character at a time, we need a loop that starts at the end of the string, and writes one character at a time, backwards, something like this:

int main()
{
    char* msg = "Hello, world!";
    const int MSGLEN = 13;

    int c = MSGLEN;
    do {

        char* addr = msg + c - 1;
        cout.write(addr,1);

        --c;
    } while(c != 0);
}

I’ve intentionally written the do-while loop in a way that mirrors the execution of the loop instruction, to make it easier to translate to assembly.

Our original program looked like this:

section .data

msg:            db      10, "Hello, world!"
MSGLEN:          equ     $-msg

section .text

;; Program code goes here

global _start
_start:






    mov     rax,    1               ; Syscall code in rax
    mov     rdi,    1               ; 1st arg, file desc. to write to
    mov     rsi,    msg             ; 2nd arg, addr. of message
    mov     rdx,    MSGLEN          ; 3rd arg, num. of chars to print
    syscall






    ;; Terminate process
    mov     rax,    60              ; Syscall code in rax
    mov     rdi,    0               ; First parameter in rdi
    syscall                         ; End process

I’ve removed the trailing 10 (\n) from the text, and moved it to the beginning, so it will still print at the “end”.

The first syscall will be inside the loop, so we can add:

section .data

msg:            db      10, "Hello, world!"
MSGLEN:          equ     $-msg

section .text

;; Program code goes here

global _start
_start:

    mov     rdi,    1               ; 1st arg, file desc. to write to
    mov     rdx,    1               ; 3rd arg, num. of chars to print



.begin_loop




    mov     rax,    1               ; Syscall code in rax
    mov     rsi,    msg             ; 2nd arg, addr. of message
    syscall




    loop .begin_loop

    ;; Terminate process
    mov     rax,    60              ; Syscall code in rax
    mov     rdi,    0               ; First parameter in rdi
    syscall                         ; End process

Note that syscall preserves rdi and rdx, so we can set those once outside the loop. However, rax is used for the return value, so we should set it every time through the loop, and rsi is the address of the start of the string, which will change as we move through the string.

We need to initialize rcx to the length of the string:

mov rcx, MSGLEN

and then we set rsi (the address to write) to rcx + msg - 1.

mov rsi, rcx
add rsi, msg-1

(add a, b performs addition, a += b, and dec a decrements --a. Both are subject to the usual restrictions: no memory-to-memory operations, both operands of the same size, etc. Because msg is a constant, the msg-1 is performed at assembly-time.)

Finally, note that rcx is one of the registers that syscall is allowed to “clobber” (r11 is the other), so we have to save it into another, safe register before the syscall and then restore it afterwards:

mov r15, rcx
syscall
mov rcx, r15

That leaves us with:

global _start
_start:

    mov     rdi,    1               ; 1st arg, file desc. to write to
    mov     rdx,    1               ; 3rd arg, num. of chars to print

    mov rcx, MSGLEN                 ; loop counter = MSGLEN

.begin_loop

    ; Print 1 char at [msg + rcx - 1]

    mov     rax,    1               ; Syscall code in rax

    mov rsi, rcx                    ; rsi = addr to print
    add rsi, msg
    dec rsi

    mov r15, rcx                    ; Save rcx before syscall
    syscall
    mov rcx, r15                    ; Restore rcx

    loop .begin_loop

    ;; Terminate process
    mov     rax,    60              ; Syscall code in rax
    mov     rdi,    0               ; First parameter in rdi
    syscall                         ; End process

Local labels

When writing loops or other labels that exist inside a function, it’s useful to write them as “local” labels, by starting them with a period. A local label is actually named after the most recent non-local label, so the full name of .begin_loop is actually _start.begin_loop. Labels can normally only be defined once per file, so without a local label, no other function we wrote could use the label begin_loop.

We’ll use local labels for all our loop and branch targets, and only use non-local labels for functions.

Negative rcx

In case you’re curious, let’s consider what happens if rcx is negative and we decrement it. E.g., if rcx = 11111111 (= -1), and we decrement:

   11111111
 - 00000001
────────────
   11111110  = -2

In other words, the result is exactly what you’d expect (but not particularly useful when used with loop).

Loop variants

There are two variants of the loop instruction which test the zero flag (ZF) along with the value of rcx:

The zero flag is connected to the idea of (in)equality because, if we perform a subtraction:

sub a, b

and a == b, then the zero flag will be set, otherwise it will be unset.

Including files

Like C/C++, yasm has a simple mechanism for including the contents of one .s file into another:

%include "source.s"

copies the contents of source.s into the current assembly file. For example, we could start to centralize a lot of our syscall definitions include an include file:

;;;
;;; sysdefs.s
;;;
[section .data]

SYS_write   equ     1
SYS_exit    equ     60

SYS_stdin   equ     0
SYS_stdout  equ     1
...

__SECT__

The [section .data] and __SECT__ stuff is “magic” to temporarily switch to the data section and then switch back to whatever section we were in before.

Arithmetic operations

add dest, src       ; dest += src
sub dest, src       ; dest -= src

add and sub perform addition and subtraction between two operands of the same size. Internally, sub is just addition with the second operand negated, and the carry inverted at the end.

Like many operations, add and sub are dyadic: they take two operands, a destination and a source, with the destination serving as both the second input and the target for the output of the operation.

Flags

add and sub set/unset the OF, SF, ZF, AF, CF, and PF flags:

Note that all flags are set/cleared on all operations, but some flags only make sense on signed/unsigned operations. The add/sub instructions don’t know whether you are performing a signed or unsigned operation, so its up to you to make sure you check the correct flags for the type of operation you are performing.

Example: Let’s perform a add operation and see how the flags are set from it:

  111  11
   10110011   = 179 (unsigned)   = -77 (signed)
 + 01100110   = 102 (unsigned)   = 102 (signed)
────────────
1  00011001   =  25 (unsigned)   = 25  (signed)

Here are two tables summarizing the results of addition and subtraction on the OF, CF, ZF, and SF flags (original):

Addition
A B A + B Flags
Hex U S Hex U S Hex U S OF SF ZF CF
7F 127 127 0 0 0 7F 127 127 0 0 0 0
FF 255 -1 7F 127 127 7E 126 126 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 1 0
FF 255 -1 1 1 1 0 0 0 0 0 1 1
FF 255 -1 0 0 0 FF 255 -1 0 1 0 0
FF 255 -1 FF 255 -1 FE 254 -2 0 1 0 1
FF 255 -1 80 128 -128 7F 127 127 1 0 0 1
80 128 -128 80 128 -128 0 0 0 1 0 1 1
7F 127 127 7F 127 127 FE 254 -2 1 1 0 0
Subtraction
A B A - B Flags
Hex U S Hex U S Hex U S OF SF ZF CF
FF 255 -1 FE 254 -2 1 1 1 0 0 0 0
7E 126 126 FF 255 -1 7F 127 127 0 0 0 1
FF 255 -1 FF 255 -1 0 0 0 0 0 1 0
FF 255 -1 7F 127 127 80 128 -128 0 1 0 0
FE 254 -2 FF 255 -1 FF 255 -1 0 1 0 1
FE 254 -2 7F 127 127 7F 127 127 1 0 0 0
7F 127 127 FF 255 -1 80 128 -128 1 1 0 1


Increment and decrement

inc dest    ; ++dest
dec dest    ; --dest

inc and dec increment/decrement their single operand, which can be either a register or a memory location. inc and dec do not modify the carry flag, as an add r, 1 or sub r, 1 instruction would. The flags OF, SF, ZF, AF, and PF are set/cleared as expected. When used on signed values, the behavior is still correct (incrementing a negative value brings it closer to 0, decrementing a negative value makes it more negative).

Addition/subtraction larger than 64-bits

The largest registers we have are 64-bits (qword). What if we want to perform an addition/subtraction on 128-bit operands (represented as, e.g., rdx:rax)? Let’s consider how we would perform word-sized addition, if the only addition we could do natively was byte-sized:

     111111←   1111
   00101101 11001101
 + 00010010 10101011 
─────────────────────
   01000000 01111000  

Adding the low bytes produced an extra carry (CF = 1), which we then used to start the addition of the high bytes. We effectively need two kinds of addition:

This is how we perform larger-than-qword addition, there is another addition operation, add-with-carry, adc which uses the status of the carry flag CF as an input for the first bit’s addition.

adc dest, src       ; dest = dest + src + CF

For subtraction there is sbb, subtract-with-borrow.

Thus, to add the double-qword rdx:rax to rcx:rbx, we would do

add rax, rbx 
adc rdx, rcx

The analogue for subtraction is sbb, subtract-with-borrow.

Multiplication and division

Multiplication and division are more complex than addition/subtraction. We will cover them in more detail later, but for now:

To store a double-qword (128-bit) result, we use a combination of rax and rdx: rax stores the low qword while rdx stores the high qword. We write this combination as rdx:rax. (Using a similar notation, we could say that ax = ah:al.) Smaller multiplications do not require this extension.

The unsigned/signed multiplication instructions are mul and imul, respectively:

Instruction Equivalent
mul rm rdx:rax *= rm, unsigned
imul rm rdx:rax *= rm, signed
imul r, rm r *= rm, signed
imul r, rm, imm r = rm * imm, signed

(For some reason, the signed multiply comes in two- and three-argument variants, while the unsigned only takes one.)

The CF and OF flags are set/cleared together, if the sign of the result is incorrect. If the result of the multiplication does not fit into the destination, the results are truncated (high bits discarded). The values in the other flags are undefined.

Division only has a single operand form, where the operand contains the divisor; the destination (which is also the dividend) is in rdx:rax. The result of div/idiv is both the rounded-down result in rax, but also the remainder (i.e., modulo or %) in rdx. Unlike C++ where we have / for integer division and % for integer modulo, in assembly a single instruction gives us both results.

Instruction Equivalent
div rm rax = rdx:rax / rm and rdx = rdx:rax % rm, unsigned
idiv rm rax = rdx:rax / rm and rdx = rdx:rax % rm, signed

An overflow in division is indicated not by setting the carry flag, but by a divide-error exception #DE, which is sent to our process as a signal SIGFPE. For now, this will immediately crash our program, but later we’ll see how to write a signal handler to deal with it in a more graceful manner. (Of course, we could also avoid the overflow by checking the operands before performing the division.)

Simple functions

As we’ll see later, calling C functions from assembly, or making our assembly functions callable from C/C++ requires a few extra steps to set up the stack correctly. However, as long as we stay purely in “assembly-land”, we don’t need to worry about the extra complexity; we can essentially make functions work however we like. The only requirements are that we be able to return from a function and get back to where we were.

The two instructions that handle functions are call and ret. Both use the stack internally:

These work together as follows (addresses are just made up):

Address Instruction Address Instruction
_start: my_func:
0x100 call my_func 0x200 mov eax, ...
0x108 mov rbx, rax 0x208
0x280 ret

While my_func is executing, the stack contains 0x108, the return address. When ret is executed, this address is popped off the stack and we resume execution at that point. (Later, we’ll see that this means if you’re using the stack for anything else, you have to make sure you’ve popped every off before you ret, so that at that point, the only thing on the stack is the return address.)

Although we can use any “calling convention” we like, in terms of passing arguments and returning results, you should try to stick with what will eventually become the convention for calling functions:

As an example, let’s write a function that prints a string (given as an address and a length) and adds a newline at the end. This will just wrap up the call to the write syscall we’ve been using:

section .data

newline:    db      10

section .text

write_ln:

    ; rdi = address
    ; rsi = length

    mov rax, 1 
    mov rdi, 1
    mov rsi, rdi
    mov rdx, rsi
    syscall

    mov rax, 1
    mov rdi, 1
    mov rsi, newline
    mov rdx, 1
    syscall

    ret

(Because we can switch sections at any time, you could put this code in a file and then %include it to make the write_ln function available in any program you write.)

To use this, we load the registers with the appropriate information and then issue a call:

section .data

msg:    db      "Hello, world!"
MSGLEN: equ     $-msg

section .text

    mov rdi, msg
    mov rsi, MSGLEN
    call write_ln

    ; Normal exit syscall...

We could, of course, further tidy this up by writing a function to wrap up the write syscall:

sys_write:

    ; rdi = address
    ; rsi = length

    mov rax, 1 
    mov rdi, 1
    mov rsi, rdi
    mov rdx, rsi
    syscall

    ret

and then our write_ln becomes just

write_ln:

    ; rdi = address
    ; rsi = length
    call sys_write

    mov rdi, newline
    mov rsi, 1
    call sys_write

    ret

Note that our functions “clobber” the registers rdi, rsi, rax and rdx: when calling either write_ln or sys_write you cannot rely on the values of these registers being preserved through the call. Theoretically, which registers and clobbered and which are preserved should be part of our function calling convention, so that when we call a function, we know which registers are “safe”. When we start using the stack, we’ll see how we can preserve register values by pushing them onto the stack before the function call, and then popping them off after it returns.

“Function pointers”

The address passed to call can be a register, not just a label:

mov r11, my_function
call r11

This is equivalent to calling a function through a function pointer.