Review of last time: structures and linked lists

We defined a structure type in assembly for linked lists:

struc node
    value:  resq    1   
    next:   resq    1
endstruc

and then we proceeded to write a number of classic linked-list functions in assembly. To review, we’ll write a list concat function, which concatenates two lists together to create a new one. We’ll use the new_node function from the previous lecture to create new nodes. In C, this would be something like

node* concat(node* a, node* b) {
  node* head = 0; // Head of new list
  node* tail = 0; // Tail of the new list

  // Copy a to new list
  while(a != 0) {
    if(tail) {
      tail->next = new_node(a->value, 0);
      tail = tail->next;
    }
    else {
      head = tail = new_node(a->value, 0); // Create first node
    }

    a = a->next;
  }

  // Copy b after it
  while(b != 0) {
    if(tail) {
      tail->next = new_node(b->value, 0);
      tail = tail->next;
    }
    else {
      head = tail = new_node(b->value, 0); // First node
    }
  }

  return head;
}

Translating this to assembly is somewhat more difficult than the other functions, because of the many calls to new_node. We should avoid using the caller-preserved variables (which includes rax, rdi, and rsi), because otherwise we would have to push/pop them on the stack before/after every call. If we use the callee-preserved registers, we only have to push/pop them once, at the beginning/end of the function.

concat:
  push rbp
  mov rbp, rsp

  ; We need four registers:
  ; r12 = a
  ; r13 = b
  ; r14 = head
  ; r15 = tail
  push r12
  push r13
  push r14
  push r15

  mov r12, rdi
  mov r13, rsi
  xor r14, r14
  xor r15, r15

  ; Copy a
.while_a:
  cmp r12, 0
  je .while_b

  ; In both branches of the if, we create the new node, so we do that here
  mov rdi, qword [r12 + value]  ; value
  mov rsi, 0                    ; next
  call new_node

  cmp r15, 0
  je .else_a
  mov qword [r15 + next], rax   ; tail->next = new_node(...)
  mov r15, rax                  ; tail = tail->next
  jmp .continue_a
.else_a:
  mov r14, rax                  ; head = tail = new_node(...)
  mov r15, rax

.continue_a:
  mov r12, qword [r12 + next]
  jmp .while_a

.while_b:
  cmp r13, 0
  je .return

  ; Call new_node(...)
  mov rdi, qword [r13 + value]  ; value
  mov rsi, 0                    ; next
  call new_node  

  cmp r15, 0
  je .else_b
  mov qword [r15 + next], rax   ; tail->next = new_node(...)
  mov r15, rax                  ; tail = tail->next
  jmp .continue_b
.else_b:
  mov r14, rax                  ; head = tail = new_node(...)
  mov r15, rax

.continue_b:
  mov r13, qword [r13 + next]
  jmp .while_b

.return:
  mov rax, r14                  ; Return value = head
  pop r15
  pop r14
  pop r13
  pop r12
  pop rbp
  ret

C-style strings

A C-style string is an array of bytes where the last byte is equal to 0. This allows us to determine the length of a string without explicitly storing it: simply traverse the string until we find this 0 byte, counting bytes as we go. In C/C++ this looks like this:

size_t strlen(char* s) {
    size_t l = 0;
    while(*s != 0) {
        ++l;
        ++s;
    }
    return l;
}

Note that 0 and the character constant \0 are exactly the same.

Finding the length of a string is of course foundational to all other string operations: copying, appending, string searching, etc. all depend on knowing how long a string is. Thus, we want to make this operation as fast as possible. The other string operations we’ll look at are

A first translation to assembly of the above might look like this:

strlen:
    ; rdi = char* s
    ; rax = returned length
    xor rax, rax

.while:
    cmp byte [rdi], 0
    je .return
    inc rdi
    inc rax
    jmp .while

.return:
    ret

This is one example of an assembly function which is not much longer than its original C version, however, accessing memory one byte at a time is not the most efficient way to implement this. In this lecture we’ll look at a number of different ways to do this, from using string-specific instructions with the rep prefix, to loading more than one byte at a time, to advanced methods which use the XMM registers to parallelize the process.

strcpy is just a matter of copying bytes from one string to another. We assume that the destination string has enough space. In C this is

char* strcpy(char* dest, char* src) {
    while(*src != 0) {
        *dest = *src;
        ++dest;
        ++src;
    }
    *dest = 0; // Copy terminating nul
    return dest;
}

(We return the address of the final terminating 0 in dest as its useful in implementing strcat below.)

or in assembly

strcpy:
    ; rdi = char* dest
    ; rsi = char* src

.while:
    cmp byte [rsi], 0
    je .done
    movsb     ; See string instructions below
    inc rsi
    inc rdi
    jmp .while

.done:
    mov byte [rdi], 0
    mov rax, rdi
    ret

In order to concatenate two strings, we have to copy the first into the destination, followed by the second, followed by the terminating 0.

char* strcat(char *dest, char* a, char* b) {
    dest = strcpy(dest, a);
    return strcpy(dest, b);
}
strcat:
  ; rdi = char* dest
  ; rsi = char* a
  ; rdx = char* b
  push rbp
  mov rbp, rsp

  call strcpy

  mov rdi, rax
  mov rsi, rdx
  call strcpy

  pop rbp
  ret

strstr is the most interesting. The natural way to implement a substring search is with a nested loop:

char* strstr(char* src, char* ptn) {
  size_t tl = strlen(src);
  size_t pl = strlen(ptn);

  for(size_t = 0; i < tl - pl; ++i) {
    bool matches = true;
    for(size_t j = 0; j < pl; ++j)
      if(src[i + j] != ptn[j]) {
        matches = false;
        break;
      }

    if(matches)
      return src + i;
  }

  return 0; // Not found, null pointer
}

This implementation runs in time \(O(mn)\) time, where m is the length of the pattern, and n is the length of the search string. There are faster algorithms, but they rely on the precomputation of various skip tables for the pattern ptn.

In order to translate this into assembly, we will need to make a few modifications. First, we want to eliminate the calls to strlen. It should be possible to detect the end of both the search string and the pattern in the loops:

char* strstr(char* src, char* ptn) {
  for(size_t i = 0; ; ++i) {
    bool matches = true;
    for(size_t j = 0; ptn[j] != 0; ++j) {

      if(src[i + j] == 0)
        break;

      if(src[i + j] != ptn[j]) {
        matches = false;
        break;
      }
    }

    if(matches)
      return src + i;
  }

  return 0; // Not found
}

The second adjustment is to notice that inside the function, we never use src except as src + i. So we can replace the outer integer loop with a pointer-based: loop:

char* strstr(char* src, char* ptn) {
  for(; ; ++src) {
    bool matches = true;
    for(size_t j = 0; ptn[j] != 0; ++j) {

      if(src[j] == 0)
        break;

      if(src[j] != ptn[j]) {
        matches = false;
        break;
      }
    }

    if(matches)
      return src;
  }

  return 0; // Not found
}

We can replace the inner loop similarly, provided we save the original ptn, so that we can restart at the beginning:

char* strstr(char* src, char* ptn) {
  for(; ; ++src) {
    bool matches = true;
    for(char* p = ptn, char* s = src; *p != 0 && *s != 0; ++p, ++s) 
      if(*s != *p) {
        matches = false;
        break;
      }

    if(matches)
      return src;
  }

  return 0; // Not found
}

Because the end-of-string condition inside the inner loop was at the beginning, we can fold it into the loop condition. Finally, note that we don’t actually need to store the value of the variable matches. For each comparison in the inner loop, if the two bytes are not equal, then we jump to the update step of the outer loop. On the other hand, if the inner loop completes normally, this means that all of the bytes were equal, and we can return. The ability of assembly to jump to an arbitrary label in this case actually allows us to simplify the code, removing a flag variable.

This function is simple enough that we can translate it into assembly.

strstr:
  ; rdi = char* src
  ; rsi = char* ptn
  ; rax = returned char*

  ; rsi = s
  ; rdi = p
  ; rax = src
  ; r11 = ptn

  mov rax, rdi
  mov r11, rsi

.src_loop:
  mov rdi, r11       ; p = ptn
  mov rsi, rax       ; s = src

.ptn_loop:
  cmp byte [rdi], 0
  je .end_ptn_loop
  cmp byte [r11], 0
  je .end_ptn_loop

  cmpsb               ; [rdi] != [rsi]
  jne .cont_ptn_loop
  ret                 ; return src

.cont_ptn_loop:
  inc rdi
  inc rsi
  jmp .ptn_loop

.cont_src_loop:
  inc rax
  jmp .src_loop

To do the actual comparison of byte [rdi] with byte [rsi], we use the string instruction cmpsb, which allows us to compare two memory operands, something we would not normally be able to do. cmps is one of a family of string instructions, all of which use rdi/rsi implicitly as addresses to read from.

String instructions

There are a number of string-specific instructions, all ending with s, which share several common features:

The rep prefixes

The rep prefixes are modifiers that can be applied to a handful of instructions telling the CPU to repeat them internally (i.e., no comparison/branch required!) until some condition is fulfilled. The prefixes available are

The rep prefix can be used with string instructions movs, lods, stos. The repe/repne prefixes can be used with string instructions cmps and scas.

Note that no matter what prefix is used, the repetition will terminate when rcx == 0. Thus, if you don’t want rcx to have any effect on the number of times the instruction is repeated, set it to the largest possible unsigned qword value:

mov rcx, -1

Transfer instructions: lods, stos and movs

There are three instructions which transfer data from memory to al, from al to memory, and from memory to memory. These are lods, stos and movs:

Instruction Description
lodsb Load byte [rdi] into al
stosb Write byte from al into byte [rdi]
movsb Copy byte from [rsi] to [rsi]

The rep prefix causes these operations to run rcx times. None of the other prefixes can be used with them, thus, the length of the input string(s) must be known in advance. Still, stosb can be used to fill an array of memory with a constant byte, while movsb can be used to copy one string into another. rep lodsb is not particularly useful, as it repeatedly loads bytes into al but then doesn’t do anything with them.

Comparison instructions: cmps and scas

cmpsb compares the byte at [rdi] with the byte at [rsi] and updates the flags accordingly. Because the repe/repne prefixes use ZF, this can be used to compare a pair of strings byte-wise, stopping at the first pair of bytes which are equal/not equal.

scas compares byte [rdi] with al and updates the flags accordingly. Thus, while cmps corresponds to comparing two strings, byte-wise, scas corresponds to searching through a string for a particular character, stored in al. Again, repe/repne can be used to search for the first occurrence of a byte which is equal/not equal to al.

Question: can we use the repe prefix to simplify the implementation of strstr above? Personally, I don’t believe so, as our termination condition is more complex than just !=; we also have to check for the terminating \0 in either list, but I’d be happy to be proved wrong!

String-operation strlen

We can use scas to implement a rep-enabled version of strlen:

strlen:
    ; rdi = char*
    ; rax = return length

    mov rcx, -1     ; Max 64-bit unsigned value
    mov r11, rdi    ; Save original rdi
    xor al, al      ; al = '\0'
    repne scasb
    ; Now rdi points to the 0 byte

    sub rdi, rax
    mov rdi, rax
    ret

Parallelizing strlen

Our current versions of strlen load only one byte at a time. We can, by using the full width of the 64-bit registers, load 8 bytes in the same amount of time, but then we have the problem of detecting a single 0 byte anywhere inside those 8 bytes. The easiest way to do this is to repeatedly shift and examine the low/high byte parts of the register:


  mov rbx, qword [rdi]
  cmp bl, 0
  je .low
  cmp bh, 0
  je .high

  ; Check next word
  shr rbx, 16
  add rdi, 2
  cmp bl, 0
  ...

Because there are four words, we will need to go through this shift/compare process four times (it will be faster if we write them all out by hand, rather than using a loop). The main trouble is that we’re using a general-purpose register (rbx) as if it was a vector register of 8 bytes, which it isn’t. If we switch to using the xmm registers, we have more instructions at our disposal which can treat all the bytes equally.

This is a portion of the optimized strlen from Abner Fog’s fast assembly C routines library:

    ; rdi = char* s

    pxor     xmm0, xmm0            ; set to zero
    mov      rax,  rdi
    sub      rax,  10H             ; rax = rdi - 16

    ; Main loop, search 16 bytes at a time
L1: add      rax,  10H             ; increment pointer by 16
    movq     xmm1, [rax]           ; read 16 bytes 
    pcmpeqb  xmm1, xmm0            ; compare 16 bytes with zero
    pmovmskb edx,  xmm1            ; get one bit for each byte result
    bsf      edx,  edx             ; find first 1-bit
    jz       L1                    ; loop if not found  


    ; Zero-byte found. Compute string length        
    sub      rax,  rdi             ; subtract start address
    add      rax,  rdx             ; add byte index
    ret

The essential idea of the algorithm is

The movq instruction loads a quadword from memory into an xmm register (or writes a quadword from an xmm register into memory). movq is a bit slow because it allows for unaligned reads; Fog’s original implementation used movdqa which requires that the address be a multiple of 16; misaligned strings are handled specially by first examining the unaligned portion before entering the main loop.

The pcmpeqb instruction (short for Compare Packed Equal Bytes) compares two xmm registers byte-wise for equality, and sets each byte in the destination to all 1s if true, or all 0s if false.

The pmovmskb instruction (Move Mask Bytes) takes the high bit of each byte component of an xmm register, and copies them into the bits of a general-purpose register. This effectively gives us the same 0-or-1 results from the comparison, but packed into the bits of edx instead of spread out over xmm1.

The bsf instruction (Bit Scan Forward) instruction searches for the first (lowest) bit which is set.

If no bits are set in edx, then none of the bytes in the 16 we loaded were 0, so we have not reached the end of the string yet. Otherwise, edx is the offset of the end-of-string byte within the 16 that we loaded.

Let’s look at an example to see how this will work: we want to find the length of the string "The quick brown fox jumped over the lazy dog.". This string has 45 characters:

T h e q u i c k b r o w n f o x j u m p e d o v e r t h e l a z y d o g . \0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

We pre-load xmm0 with all 0s:

xmm0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Byte 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

The first iteration through the loop will load The quick brown into xmm1.

xmm1 T h e q u i c k b r o w n
Byte 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Since none of these bytes are \0, the comparison will set everything in xmm1 to 0s:

xmm1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Byte 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Which we then copy into edx, as single bits: edx = 0000000000000000b.

Because no bit is set, bsf will set the zero flag ZF = 1, causing the jump to L1.

The next cycle, rax = 16, so we will load fox jumped over into xmm1. As there are no zero bytes here, either, so the process repeats.

Finally, we will load the lazy dog.\0 plus an additional 2 bytes beyond the end of the string into xmm1:

xmm1 t h e l a z y d o g . \0
Byte 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

After the comparison, we will have xmm1 set as

xmm1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
Byte 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

This is then copied into edx as 0010000000000000b. bsf will set edx to 13, the index of the set bit, while also setting ZF = 0. This terminates the loop.

The final length of the string is 32 (rax) plus 13 (bit index) = 45, as expected.