Function calling conventions

To review, when we call a function, we have to choose some registers to use for arguments, at least one to use for return value, some to be caller-saved (available for temporary use by the function) and some to be callee-saved. Our choices for these were selected so as to align with the standard Unix C ABI calling convention, so with a bit more work, our functions will be compatible with the C standard library.

Register	Use
`rax`	Return value
`rbx`	Callee-preserved
`rcx`	4th argument
`rdx`	3rd argument
`rsi`	2nd argument
`rdi`	1st argument
`rbp`	Callee-preserved
`rsp`	Stack pointer
`r8`	5th argument
`r9`	6th argument
`r10`	Temporary (caller-preserved)
`r11`	Temporary (caller-preserved)
`r12`-`r15`	Callee-preserved

Note that all argument and return value registers count as caller-preserved. I.e., if your function is using rcx and you want to call a function, you must preserve rcx even if the function doesn’t use it as an argument.

The general process for calling a function is thus

...
    push r10        ; Push any caller-saved registers in use
    call func   
    pop r10         ; Restore after return

Similarly, within a function, we have a preamble which saves any callee-saved registers, and a prologue which restores them:

func:
    push r12        ; Push any callee-saved registers in use
    ...
    pop r12         ; Restore them before return
    ret

As we will see, making our functions C-compatible will require a few changes to the function preamble/prologue to setup the stack correctly.

Stack operations

To manually add and remove elements to/from the stack, we use the push and pop operations:

push op – Pushes the operand op onto the stack. I.e., decrements rsp by the size of op and then performs mov [rsp] op. op cannot be a 64-bit immediate, but it can be a smaller immediate (8, 16, or 32-bits). Note that you cannot push a byte-sized register or memory operand, but you can push a byte-sized immediate.
pop op – Pops the top value of the stack into op. I.e., performs mov op [rsp] and then increments rsp by the size of op. op cannot be an immediate (obviously), and must be a word or larger.

Note that in x86-64, the stack starts at the “end” of memory and grows backwards, so that pushing decrements rsp and popping increments it. Thus, to look “down” the stack we will access rsp +some distance.

Note that you are free to modify the stack without using these instructions. For example, if you wish to “pop” the top 8 bytes off the stack but don’t care about their values, you can simply do

add rsp, 8

to move rsp up by 8.

The fact that the stack starts at the end of a process’s address space and grows backwards means that if you look at rsp at any point, it’s probably huge (on the order of terabytes). This doesn’t mean that your process can actually use that much memory; as we’ll see later, the system has a way of not allocating parts of memory which are unused.

Sorting in Assembly

To review we’re going to develop a couple of 133 algorithms into their assembly equivalents. Here is the preamble we’re working with:

section .data

; Load 1M of random data from random.dat

data:           incbin                  "random.dat"
DATLEN:         equ                     $-data

section .text
global _start
_start:
...

(Rather than going to the trouble of using the open/read syscalls, we just compile the random data directly into the executable!)

Checking for sorted-ness

We’re going to implement a sorting algorithm, but we need a way to check and see if it is working correctly. So we’re going to write a function is_sorted which will set rax to 1 if the input array is sorted, and 0 if it is not. It will receive the address of the array as rdi and the length as rsi. We will consider the array to consist of signed qwords (8 bytes per array entry).

In C/C++, we could check such an array with something like

bool is_sorted(long* arr, size_t length) {
    size_t i = 0; 
    while(i < length-1) {
        if(arr[i] > arr[i+1])
            return false;

        ++i;
    }

    return true;
}

There are three situations where we have non-sequential execution:

If the while condition is false, we jump to the final return.
At the end of the loop, we jump (unconditionally) back to the beginning.
If the if condition is false, we jump to the i++ at the end of the loop.

In assembly, we have something like this:

is_sorted:

    ; rdi = address of array
    ; rsi = length (in bytes)
    ; Returns: rax = 1 if sorted

    sub rsi, 8          ; Last element
    mov rax, 0
.while:
        cmp rax, rsi
        je .done

        mov r8, qword [rax + rdi]
        mov r9, qword [rax + rdi + 8]
        cmp r8, r9
        jle .continue
        mov rax, 0        
        ret

    .continue:
        add rax, 8
        jmp .while

.done:
    mov rax, 1
    ret

We’re using an extended form of the memory operand to simplify the array lookups. In particular, memory operands can consist of a sum of

A constant address
Up to two registers, where one is called the base register and the other the index register and
The index register can be multiplied by 1, 2, 4, or 8.

Since there’s no restriction on which registers can be bases vs. indexes, in practice, one of the registers can be multiplied and the other cannot.

There’s one difference between the assembly and the C/C++ code: the length in assembly is given in the number of bytes, not the number of array elements. It’s fairly common in assembly routines for lengths to be given in bytes, regardless of the actual size of the elements.

Insertion sort

We’re going to implement insertion sort. In C++ insertion sort looks like this:

void insertion_sort(long* arr, unsigned long size) {
    unsigned long i = 1, j;
    while(i < size) {
        j = i;
        while(j > 0) { 
            if(arr[j] >= arr[j-1])
                break;
            else
                std::swap(arr[j], arr[j-1]);

            --j;
        }

        ++i;
    }
}

(I’ve written this using while loops so that it will be easier to translate into assembly; the natural way to write it in C/C++ would be to use for loops.)

In assembly, a while loop has the structure:

.while:
    cmp ...
    jcc .end_while

        ...

    jmp .while
.end_while

In keeping with our calling convention for functions, our function will receive the address of the array in rdi, and the size (unsigned) in rsi.

insertion_sort:
    ; rdi = addr of array
    ; rsi = length of array

    mov rax, 1          ; rax = outer loop index, unsigned

    .while_i:
        cmp rax, rsi
        je .done
        mov rbx, rax    ; rbx = inner loop index

        .while_j:
            cmp rbx, 0
            je .done_i

            mov r8, qword [rdi + 8*rbx]
            mov r9, qword [rdi + 8*rbx - 8]
            cmp r8,r9
            jge .done_i ; break

            ; swap
            mov qword [rdi + 8*rbx],     r9
            mov qword [rdi + 8*rbx - 8], r8

            dec rbx
            jmp .while_j

    .done_i:
        inc rax
        jmp .while_i

.done
    ret

We could simplify the comparison of the two values in the inner loop by using the string comparison operation cmpsb which allows a memory-memory comparison between byte [rsi] and byte [rdi].

Binary search

Implementing binary search allows us to make use of another conditional operation: conditional moves. Conditional jumps are somewhat expensive, as the CPU cannot pre-load the instruction after the jump, until it has processed the conditional jump itself. A conditional move is a mov with a condition code attached; the move only happens if the condition is true. The conditional move instruction is cmov** where ** is one of the condition codes.

The restrictions on conditional moves’ operands are

The first operand must be a register
The second can be a register or memory, but not an immediate
Both must be the same size
Byte-sized conditional moves are not allowed.

For reference, the implementation of a binary search in C++ is

unsigned long binary_search(long* arr, unsigned long len, long target) {
    unsigned long low = 0, high = len-1, mid;

    while(low < high) {
        mid = (high - low) / 2 + low;

        if(target < arr[mid])
            high = mid-1;
        else if(target == arr[mid])
            return mid;
        else
            low = mid+1;
    }

    if(arr[low] == target)
        return low;
    else
        return -1; // 0xffffffff... 
}

(Normally we would write the condition as low <= high and skip the final if, however, if low and high are unsigned, then if high == 0 and we do high - 1, we get high = 0xfffff..., and so the condition is never false.)

binary_search:

    ; rdi = address of array
    ; rsi = length of array
    ; dl = target to search for

    mov rax, 0          ; rax = low = 0
    mov rbx, rsi        ; rbx = high = len-1
    dec rbx;

.while:
    cmp rax, rbx
    jae .not_found      ; Stop if rax > rbx

    ; rcx = mid = (high - low)/2 + low
    mov rcx, rbx        ; rcx = high
    sub rcx, rax        ; rcx -= low
    shr rcx, 1          ; rcx /= 2
    add rcx, rax        ; rcx += low                        

    mov r14, rcx        ; r14 = mid+1
    inc r14
    mov r15, rcx        ; r15 = mid-1
    dec r15

    cmp rdx, qword [rdi + 8*rcx] ; compare target, arr[mid]
    cmovg rax, r14               ; target > arr [mid] ===> low = mid+1
    cmovl rbx, r15               ; target < arr [mid] ===> high = mid-1
    cmove rax, rcx               ; target == arr[mid] ===> rax = mid (index)
    je .return                   ; target == arr[mid] ===> return

    jmp .while
.not_found:
    cmp qword [rdi + 8*rax], rdx
    je .return

    mov rax, -1
.return:
    ret

The section of code

    mov r14, rcx        ; r14 = mid+1
    inc r14
    mov r15, rcx        ; r15 = mid-1
    dec r15

might seem counterintuitive. How could it possibly be faster to compute mid-1 and mid+1, when it’s possible that neither will be needed? Due to instruction-level parallelism, the processor may actually be able to compute both of these at the same time, overlapping both with the computation of of cmp dl, byte [rcx]. Sometimes, it can actually be faster to give the compiler more work to do, so long as the work can be parallelized.

The final program, incorporating all these routines, is as follows:

;;;;
;;;; sort_search.s
;;;; Sorting and searching in assembly.
;;;;
section .data

data:           incbin                  "random.dat"
DATLEN:         equ                     ($-data) / 8

sorted:         dq      1,9,2,8,3,7,4,6,5
SORTLEN:        equ     ($-sorted)/8

section .text

;;; _start
;;; Program entry point
global _start
_start:

    mov rdi, data
    mov rsi, DATLEN
    call insertion_sort

    mov rdi, data
    mov rsi, DATLEN
    mov rdx, 10000
    call linear_search
    mov r10, rax ; Save index

    mov rdi, data
    mov rsi, DATLEN
    mov rdx, 10000
    call binary_search

    mov r8, 0
    mov r9, 1
    cmp rax, r10
    cmove rdi, r8
    cmovne rdi, r9  

    ; Exit with sorted-ness in exit code    
    mov rax, 60
    syscall 

;;; is_sorted
;;; Returns true if an array is sorted.
;;;
;;; rdi = address
;;; rsi = length
;;; Returns: rax = 1 if sorted, 0 otherwise.
is_sorted:

    ; rdi = address of array
    ; rsi = length
    ; Returns: rax = 1 if sorted

    dec rsi
    mov rax, 0
.while:
        cmp rax, rsi
        je .done

        mov r8, qword [8*rax + rdi]
        mov r9, qword [8*rax + rdi + 8]
        cmp r8, r9
        jle .continue
        mov rax, 0        
        ret

    .continue:
        inc rax
        jmp .while

.done:
    mov rax, 1
    ret

;;; insertion_sort
;;; Sorts and array of qwords
;;; rdi = address
;;; rsi = length
;;; Returns nothing
insertion_sort:
    ; rdi = addr of array
    ; rsi = length of array

    mov rax, 1          ; rax = outer loop index, unsigned

    .while_i:
        cmp rax, rsi
        je .done
        mov rbx, rax    ; rbx = inner loop index

        .while_j:
            cmp rbx, 0
            je .done_i

            mov r8, qword [rdi + 8*rbx]
            mov r9, qword [rdi + 8*rbx - 8]
            cmp r8,r9
            jge .done_i ; break

            ; swap
            mov qword [rdi + 8*rbx],     r9
            mov qword [rdi + 8*rbx - 8], r8

            dec rbx
            jmp .while_j

    .done_i:
        inc rax
        jmp .while_i

.done
    ret

;;; binary_search
;;; Does a binary search over an array for a target value
;;; rdi = address
;;; rsi = length
;;; rdx = target
;;; Returns: index of target or -1 (0xfffff...) if not found
binary_search:

    ; rdi = address of array
    ; rsi = length of array
    ; dl = target to search for

    mov rax, 0          ; rax = low = 0
    mov rbx, rsi        ; rbx = high = len-1
    dec rbx;

.while:
    cmp rax, rbx
    jae .not_found      ; Stop if rax > rbx

    ; rcx = mid = (high - low)/2 + low
    mov rcx, rbx        ; rcx = high
    sub rcx, rax        ; rcx -= low
    shr rcx, 1          ; rcx /= 2
    add rcx, rax        ; rcx += low                        

    mov r14, rcx        ; r14 = mid+1
    inc r14
    mov r15, rcx        ; r15 = mid-1
    dec r15

    cmp rdx, qword [rdi + 8*rcx] ; compare target, arr[mid]
    cmovg rax, r14               ; target > arr [mid] ===> low = mid+1
    cmovl rbx, r15               ; target < arr [mid] ===> high = mid-1
    cmove rax, rcx               ; target == arr[mid] ===> rax = mid (index)
    je .return                   ; target == arr[mid] ===> return

    jmp .while
.not_found:
    cmp qword [rdi + 8*rax], rdx
    je .return

    mov rax, -1
.return:
    ret

;;; linear_search
;;; Does a linear search in an array.
;;; rdi = address
;;; rsi = length
;;; rdx = target
;;; Returns rax = index of target or -1 (0xffff...) if not found
linear_search:
    ; rdi = address of array
    ; rsi = length of array
    ; dl = target to search for
    ; Returns: rax = index of target, or -1 (0xffff...) if not found

    mov rax, 0  ; Index

    .while:
        cmp rax, rsi
        je .not_found

        cmp rdx, qword [rdi + 8*rax]
        jne .continue       ; Not equal: next iteration
        ret                 ; Equal, return current rax

    .continue:
        inc rax        
        jmp .while
    .not_found:
    mov rax, -1
    ret

This does an insertion sort, then does a linear search, followed by a binary search (for the same target), and exits with code 0 (success) if both searches returned the same index.

The above file is available on the server as /usr/local/class/cs241/sort_search.s.

Bubble sort

Implementing Bubblesort is left as an exercise, but for reference, the C/C++ implementation is

void bubble_sort(long* arr, unsigned long len)
{
    unsigned long i = len-1, j;
    while(i > 0) {
        j = 0;
        bool swapped = false;
        while(j < i) {
            if(arr[j] < arr[j+1]) {
                std::swap(arr[j], arr[j+1]);
                swapped = true;
            }
            ++j;
        }

        if(!swapped)
            break;

        --i;
    }
}

C-compatible Functions

As we’ve seen, functions in assembly operate at a much lower level than in C or C++. We’ll have to learn how to manually manage the stack space used by our functions.

`call` and `ret`

The basic instructions for implementing function calls are call and ret:

call Func pushes the IP register rip onto the stack and then jumps to the label Func.
ret pops the top of the stack into rip and then jumps to rip. Because rip always points to the next instruction, this will jump to the instruction immediately after the call that brought us into this function.

Thus, at the most basic level, the stack is used to keep track of where to return to. This allows us to have function f call function g, and then g calls h, and when h/g return, we have a trail of “breadcrumbs” (return addresses) on the stack, telling us how to get back to f.

This is not all the stack is used for, however. Because we have a limited number of registers, we also use the stack for:

Temporary storage for register values we want to save and then restore later.
Passing parameters, when a function has more parameters than will fit in the six registers designated for this.
Returning values, when the return value is larger than will fit in rdx:rax.
Local variables, when there are more local variables than available registers (or when we need to use the address of a local variable; registers do not have addresses).

Stack direction

The stack, contrary to the way we normally draw it, starts at the end of a program’s memory space and grows backwards, towards 0. Thus, the stack operations push and pop have the following effects:

push x: subtract the size of x in bytes from rsp, and then mov [rsp], x.
pop x: mov x, [rsp] and then add the size of x to rsp.

It’s important to bear in mind that the size of the operand to push/pop affects how the stack pointer moves. If you push a word, rsp will be decremented by 2.

You can push/pop registers, memory operands, or immediates, but the minimum size is a word. You cannot push/pop individual bytes.

Stack alignment

The System-V x86-64 function call specification requires that the top of the stack (i.e., the address in rsp) always be aligned to a multiple of 16 bytes, plus an offset of 8 bytes before any call instruction. This means that the value of rsp must always be expressible as

$$\mathtt{rsp} = 16 a + 8$$

for some integer a whenever we call a function. Put another way, if you divide rsp by 16, you should always get a remainder of 8.

(The reason for the extra 8 is that call itself pushes 8 bytes onto the stack, in the form of the return address.) We didn’t have to worry about this as long as we weren’t interacting with the rest of the system, but as soon as we try to write main instead of _start, we’ll need to follow these rules.

The system will setup the stack for us in this way when it calls our main function, but then it immediately pushes rip, making it off by 8 again. Hence, the first thing that every function should do is realign the stack pointer, either by

Subtracting 8 from rsp: sub rsp, 8 (remember that the stack grows down in terms of addresses!)
Pushing rbp onto the stack (see below): push rbp

Before returning, a function should undo these actions, so that when ret pops the rip, the stack will again be correctly aligned.

Similarly, if you are pushing arguments onto the stack (see next section), you will have to ensure that the final rsp immediately before the call is properly aligned, otherwise the call will segfault, because the compiler has generated aligned-moves in the assembly of the function.

Leaf functions

A “leaf function” is one that

Takes all of its parameters in registers
Returns its result in registers
Does not call any other functions
Does not push anything onto the stack

I.e., a leaf function does not use the stack, either directly or indirectly. Hence, a leaf function is allowed to not align the stack pointer, and also to skip setting up rbp (see below). This is fine because a leaf function is never going to call another function or use the stack.

Passing arguments

Traditionally, arguments were passed by pushing them onto the stack, in the reverse order of that in which they are passed (i.e., last argument pushed first). This incurs significant overhead on calling functions, as functions must access memory in order to read in their arguments. The x86-64 application binary interface (which defines the function calling convention and many other standard assembly-level “interfaces”) modifies this so that

The first six (non-floating point) arguments are passed via registers, specifically the registers rdi, rsi, rdx, rcx, r8 and r9. If an argument is less than 64-bits then it should use the appropriate portion of its respective register (e.g., edi for 32-bits).
The first eight floating-point parameters are passed via the floating point registers xmm0 through xmm7.
Any remaining arguments are pushed onto the stack in reverse order.
Arguments larger than 64-bits are generally passed either on the stack, or in memory, as an address. We’ll cover passing/returning structures later, but see the System-V x86-64 ABI in the meantime for all the gory details.

E.g., suppose we have a function

int f(long x, float y, char* z);

The registers used by this function in call/return are

rdi – contains the value of x (signed qword)
xmm0 – contains the value of y
rsi – contains the value of z (address, qword)
eax – contains the return value (dword)

Note that because any stack-based arguments are pushed by the calling function, they will appear on the stack below the rip value pushed by the call instruction. This might be somewhat surprising if you were thinking of the called function’s stack frame (see below) as “starting” at the pushed rip. The stack frame includes, first, any registers that the function needs to preserve (caller-preserved), followed by any stack-based local variables.

Also note that, because a function pushes any extra arguments before rip, it is not possible for the called function to pop them off; hence, if your function pushes any arguments onto the stack, it should pop them off, or just increment rsp appropriately, after the function returns, so as to leave the stack in the same state.

As mentioned above, before executing call the stack pointer must be aligned to a multiple of 16 + 8, so if you are calling a function which has any stack-based arguments, you should add up the total size of these arguments and then, if necessary, adjust the rsp (by subtraction) so that it is aligned.

The reason for requiring stack alignment is that it enables the use of the more-efficient aligned move instructions. These move values into/out of the vector registers xmm0,1,..., but they require that the address of a memory operand be aligned to a multiple of 8 bytes. Attempts to do an aligned move from an unaligned address triggers a segfault. (E.g., if you call a C library function without aligning the stack pointer first, your program will segfault at the point of the call.)

Returning results

The return result should be placed in rax, or xmm0 if floating point. Technically, rax and rdx can be used to return either two values (not supported in C/C++) or a single 128-bit value, as rdx:rax.

Larger return values are typically “returned” by passing in a pointer to memory as an implicit first argument (i.e., in rdi, pushing all remaining arguments down in the list), modifying the memory within the function, and then returning the same pointer in rax. Often, the calling function will allocate space for the structure in its stack frame, but this is not easily accessible to the called function, hence the pointer-argument.

Returning from a function with local variables/callee-saved regs.

Remember that the ret instruction simply pops the stack into rip and then jumps there. Hence, if a function used the stack for any of its local variables, it must ensure that the top of the stack contains the calling function’s rip before executing ret. Typically this is done not by pop-ing, but just by incrementing the stack pointer rsp. Similarly, if you had to adjust rsp to align it, you will have to de-align before you start popping.

If rip is not on the top of the stack when ret is executed, whatever 8 bytes are on top of the stack will be treated as a 64-bit address and jumped to, most likely jumping into random memory and crashing your program.

Register usage

Because all functions must use the same set of registers, there is competition for who gets to use which register: is the value in a given register guaranteed to be preserved by a function call, or is the function allowed to use the register without bothering to save its value? Both have certain advantages:

If a function must preserve the value of a register (callee-preserved) it makes it much easier to call functions; you don’t have to worry about saving your registers before every call, and then restoring them afterwards (think about what we have to do with rcx when doing a syscall).
If a function is allowed to clobber the value of a register (caller-preserved) it makes it easier and faster to write functions: functions don’t have to do any extra work to save/restore registers, so they execute faster and use less of the stack.

When calling a function, some registers are used for passing arguments (see next section), some are reserved for the use of the called function (caller-preserved), and some are guaranteed/required to be saved by the called function (callee-preserved). You can rely on these registers containing the same values before and after a function call.

Of course, this means that when you are writing a function, you have to take care to preserve the values in these registers if you are using them, and restore them before you ret. The best way to do this is to push them onto the stack at the beginning of your function, and then pop them off immediately before you ret. (Of course, if any of your function’s arguments are on the stack – see the next section – then you will have to pop those before pushing the callee-saved registers.)

Register	Use
`rax`	Return value
`rbx`	Callee-preserved
`rcx`	4th argument
`rdx`	3rd argument
`rsi`	2nd argument
`rdi`	1st argument
`rbp`	Callee-preserved
`rsp`	Stack pointer
`r8`	5th argument
`r9`	6th argument
`r10`	Temporary (caller-preserved)
`r11`	Temporary (caller-preserved)
`r12`-`r15`	Callee-preserved

The argument and “temporary” registers are not callee saved, and must be saved by the calling function if their values should be preserved. This means that within a function we can uses these registers freely, without the need to save/restore their values. The registers labeled “callee-saved” must be saved by a function if it needs to modify them, and then restored to their original values before return.

The usual way to save the callee-saved registers (only the ones you use in your function need to be saved, of course), is to push them onto the stack. Because this is done inside the called function, at the beginning of the function in what’s called the preamble, these registers’ values will appear on the stack above the pushed rip.

Floating-point arguments

Although we haven’t covered floating-point yet, that uses registers as well, at least for the first 8 f-p arguments:

The first 8 floating-point arguments are passed in registers xmm0-xmm7.
Remaining f-p arguments are passed on the stack.
For variadic functions (those which take an unlimited number of arguments), set the al register to the number of floating-point arguments passed in xmm registers.

Stack frame

The portion of the stack that is “created” when a function is called is called its stack frame. At a minimum, this must include the return address (i.e., the rip that the ret instruction will jump to), but may include additional information:

Arguments, if there are more than 6
Values of callee-saved registers
Local variables (if there are too many to fit in registers)

Many small functions need nothing more than the return address; these functions have a fixed-size stack frame, just big enough to hold the address (i.e., 8 bytes).

Traditionally, the rbp (base pointer) register was used to point to (sort of) the beginning of a function stack frame; its value points to the stack element just above the pushed return address. If we want to use rbp this way, then we must preserve it over function calls, so the value on the stack that rbp points to is in fact the calling function’s rbp. Thus, the stack is laid out like

Addr	Contents
	…
	—-
	caller-saved registers…
`rbp+n`	…
`rbp+16`	stack-based arguments…
`rbp+8`	`rip` (return address)
`rbp`	caller’s `rbp` (`<-- rbp`)
`rbp-8`	callee saved regs and local vars
`rbp-m`	…
`rsp`	top of stack (`<-- rsp`)
…	Red area
`rsp-128`

This allows stack-based arguments and local variables to be accessed via offset from rbp. E.g., the first stack-passed argument is located at rbp + 16. (Because arguments are pushed by the caller, they come before rip on the stack, which is implicitly pushed by call.) Of course, the size of any stack-passed arguments affects where, above rbp, they will be located.

Before a function returns, it simply sets rsp to rbp (thus “popping” everything on the stack other than rbp and rip), pops into rbp (restoring the caller’s rbp) and the returns.

Used thus, rbp doesn’t exactly point to the beginning of the stack frame, but it points to the element just above rip, which is close enough. (The passed arguments on the stack, having been pushed by the calling function, are not part of the called function’s stack frame; they will be cleaned up by the calling function.)

Pushing/popping the rbp also takes care of aligning the stack pointer, as the rbp is 8 bytes, exactly the offset we would need to manually subtract to realign rsp. Again, if you don’t push rbp, you must manually align rsp by subtracting 8 from it. If you are pushing callee-saved registers or reserving space for local variables, you may need to subtract more or less in order to correctly align the stack pointer. It’s up to you to ensure that rsp is correctly aligned.

If you know where something is on the stack in relation to rsp (or, more likely, rbp), you can use it directly:

mov rax, [rbp + 16] ; First pushed argument

Similarly, if you’ve reserved some space above rbp for local variables, you can access them as

mov rax, [rbp - 8] ; First local variable (assuming no called-saved regs)

The 128 bytes above the top of the stack (i.e., below rsp) are called the “red area”. They can be used for temporary storage, although if you call any functions it may be overwritten. This allows “semi-leaf” functions (functions which do not call any other functions, but still want to use some stack space).

`enter` and `leave`

The enter and leave instructions perform this stack setup/teardown for you.

enter n pushes rbp onto the stack and then mov rbp, rsp and then decrements rsp by n, effectively reserving n bytes on the stack for local variables or callee-saved registers.
leave does mov rsp, rbp and then pops the old rbp. (Note that rsp must be adjusted prior to leave to ignore any local variable space!)

enter and leave are useful when we have functions which have local variables on the stack. However, as we’ve seen, we usually have enough registers so that this is not necessary, so the convenience of having instructions to setup/teardown the stack is not really necessary. I won’t use these instructions in any of the functions I write.

Calling standard library functions

If you link with the C standard library, then the entry point of your program becomes main, not _start; the exit code should be returned in rax (and main should end with ret). (All C functions are stored in the executable file with an underscore added to their names.)

To call C functions from assembly, it’s as simple as

extern printf
...
mov rdi, msg      ; Address of nul terminated string
call printf

Thus, a “Hello, world” program using the standard library becomes just

section .data

msg:      db      "Hello, world!", 10, 0

section .text
extern printf

global main
main:

  push rbp        
  mov rbp, rsp

  mov rdi, msg
  call printf
  ; Return value if any in rax

  pop rbp
  mov rax, 0      ; "return 0;"
  ret

(If you forget to push rbp or add rsp, 8 at the beginning, you’ll get a segfault when you try to call printf.)

To assembly a function that uses the standard library, you can either use the asm script provided, or do it manually, via

yasm -g dwarf2 -f elf64 -o program.o program.s -l program.lst
gcc -o program program.o

Only the link step is different; the assemble command-line is the same as before. As you can imagine, this will make things much easier going forward!

Function template

The following is a template for a function which preserves the callee-saved registers and uses rbp to point to the beginning of its stack frame:

;; func_name(arguments...)
;; What the function does
;; 
;; Arguments:
;;   arg1 -- rdi
;;   arg2 -- si
;;   ... etc
;;   arg7 -- qword, stack [rbp + 16]
;; 
;; Return value: eax
;;
global func_name
func_name:

    ;; Preamble
    push rbp     ; Save calling function's rbp
    mov rbp, rsp ; rbp points to our stack frame

    push r12     ; Push any callee-saved registers you use

    ; If you need space for stack-based local variables, you can reserve it with
    ;   sub rsp, amount

    ; Realign rsp to 16*rsp + 8

    ;; Function body
    ...          

    ;; Epilogue

    ; Remove any alignment added

    ; If you reserved any space for stack-based local variables, you should 
    ; restore it with
    ;   add rsp, amount    

    pop r12      ; Restore callee-saved registers
    pop rbp      ; Restore caller's rbp
    ret

Recursive functions

Note that a function written using the above template can be recursive without any additional work. As we would expect, the stack is used to keep track of all the different function invocations. Here is a recursive function which computes sum(x) = x + (x-1) + ... + 2 + 1 + 0:

global sum
sum:

    ;; Preamble
    push rbp     ; Save calling function's rbp
    mov rbp, rsp ; rbp points to our stack frame

    push rbx     ; Push any callee-saved registers you use
    sub rsp, 8   ; Re-align rsp

    cmp rdi, 0   ; If 1st arg == 0, goto base case
    je base_case

; Recursive case

    mov rbx, rdi ; Save argument
    dec rdi       
    call sum     ; Call sum(x-1)
    add rax, rbx ; Add, return in rax

    add rsp, 8
    pop rbx      ; Restore callee-saved registers
    pop rbp      ; Restore caller's rbp
    ret

base_case:
    mov rax, 0   ; Return 0

    add rsp, 8
    pop rbx      ; Restore callee-saved registers
    pop rbp      ; Restore caller's rbp
    ret

Note that we have to clean up the stack before any ret is issued, so if your function has multiple exit points, they must all have the same cleanup code. You might want to have a single exit point at the end of your function, and then just jmp to it from every other exit point, for simplicity.

Rewriting insertion sort as a proper function

Here we will rewrite our insertion_sort as a well-behaved function.

The registers we use in the implementation are

rdi, rsi as arguments. This is fine, we don’t have to save their values or anything.
rax,rbx as loop counters. rax is fine, because that is supposed to be the return value, and our function doesn’t return anything, but rbx is callee-preserved, so it’s our responsibility to save it and restore it before we ret.
r8 and r9 as temporary values used in swapping. These are normally used as arguments, so there’s no guarantee that the callee will preserve them, so we’re fine there.

;;; insertion_sort
;;; Sorts and array of qwords
;;; rdi = address
;;; rsi = length
;;; Returns nothing
insertion_sort:
    ; rdi = addr of array
    ; rsi = length of array

    push rbp            ; Stack base pointer
    mov rbp, rsp        ; rbp points to our stack frame
    push rbx            ; Save rbx
    sub rsp, 8          ; Re-align rsp


    mov rax, 1          ; rax = outer loop index, unsigned

    .while_i:
        cmp rax, rsi
        je .done
        mov rbx, rax    ; rbx = inner loop index

        .while_j:
            cmp rbx, 0
            je .done_i

            mov r8, qword [rdi + 8*rbx]
            mov r9, qword [rdi + 8*rbx - 8]
            cmp r8,r9
            jge .done_i ; break

            ; swap
            mov qword [rdi + 8*rbx],     r9
            mov qword [rdi + 8*rbx - 8], r8

            dec rbx
            jmp .while_j

    .done_i:
        inc rax
        jmp .while_i

.done:
    add rsp, 8        ; De-align rsp
    pop rbx
    pop rbp
    ret

Interoperating with C-functions

The rules given above describe the “calling convention” used by C functions. (C++ functions use the same calling conventions, but the names of C++ functions are “mangled” to include their argument and return types.) This means that by using the above rules, we can

Call functions written in C, including C-standard-library functions.
Call assembly functions from C/C++ code.

In order to call a C function, we have to do three things:

Write the entry point of our program as main instead of _start, declared as global. Setup the stack correctly on function entry, tear it down on exit, and return the exit code of the program through rax (i.e., no need to use the SYS_EXIT syscall).
Declare any C functions we want to call in our assembly file as extern. E.g., extern printf.
Link with the C library. The asm script detects whether your assembly file uses _start as its entry point, or main, and will automatically link with the C standard library in the latter case, so you don’t need to worry about it.

Here’s an example: The C function puts(char* s) writes a nul-terminated string to the standard output stream, followed by a newline. I.e., it does what the SYS_WRITE syscall does, except that we don’t need to give it the length (just make sure we end the string with a 0-byte), and it will automatically add a newline. To print the “Hello, world!” message using puts we do

section .data

msg:    db      "Hello, world!", 0

; Note: no length, no 10 (newline) byte

section .text

extern puts
global main

main:

    ; Setup stack
    push rbp
    mov rbp, rsp

    ; Set registers to arguments. 
    mov rdi, msg
    call puts

    ; Cleanup stack, return 0
    pop rbp
    mov rax, 0
    ret

One very useful C function is printf, which serves as a replacement for cout. It takes at least one argument, a format string (char*, nul-terminated), which contains placeholders, starting with the % character. The remaining parameters are “spliced in” during printing by replacing their respective placeholders. For example,

printf("%d, %d, %d\n", 1, 2, 3);

would print

1, 2, 3

followed by a newline. %d is the placeholder for a signed 32-bit integer value. The useful placeholders are

Placeholder	Description	Argument type
`%c`	Single character	`int` (not `char`!)
`%s`	Nul-terminated string	`char*`
`%d`	Signed integer	`int` (dword)
`%ld`	Signed integer	`long` (qword)
`%hd`	Signed integer	`short` (word)
`%hhd`	Signed integer	`char` (byte)
`%u`	Unsigned integer	`unsigned int`
`%lu`, `%hu`, …	Unsigned integers	As for `%d`
`%f`	Floating-point value	`double` (not `float`!)

As printf takes any number of arguments, it is a variadic function, which adds an extra rule to how its arguments are passed:

If there are any floating-point arguments passed in the xmm registers, then set al to the number of these arguments.

A counterpart to printf is scanf which takes a similar formatting string except that it reads input from stdin, according to the formats given. The remaining arguments must be pointers to variables which the inputs will be written into. E.g., if we do

int x, y, z;
scanf("%d %d %d", &x, &y, &z);

and type in 1 2 3 then variable x will contain 1, y will contain 2, and z will contain 3.

With scanf, it is vitally important to use the right format specifier! E.g., using %d (dword) when you are intending to read in a qword will cause problems, typically because scanf will only write to the low dword, leaving the high 4 bytes unchanged.