Review of last time: structures and linked lists
We defined a structure type in assembly for linked lists:
struc node
value: resq 1
next: resq 1
endstruc
and then we proceeded to write a number of classic linked-list functions
in assembly. To review, we’ll write a list concat
function, which concatenates
two lists together to create a new one. We’ll use the new_node
function from
the previous lecture to create new nodes. In C, this would be something like
node* concat(node* a, node* b) {
node* head = 0; // Head of new list
node* tail = 0; // Tail of the new list
// Copy a to new list
while(a != 0) {
if(tail) {
tail->next = new_node(a->value, 0);
tail = tail->next;
}
else {
head = tail = new_node(a->value, 0); // Create first node
}
a = a->next;
}
// Copy b after it
while(b != 0) {
if(tail) {
tail->next = new_node(b->value, 0);
tail = tail->next;
}
else {
head = tail = new_node(b->value, 0); // First node
}
}
return head;
}
Translating this to assembly is somewhat more difficult than the other
functions, because of the many calls to new_node
. We should avoid using the
caller-preserved variables (which includes rax
, rdi
, and rsi
), because
otherwise we would have to push/pop them on the stack before/after every call. If
we use the callee-preserved registers, we only have to push/pop them once, at
the beginning/end of the function.
concat:
push rbp
mov rbp, rsp
; We need four registers:
; r12 = a
; r13 = b
; r14 = head
; r15 = tail
push r12
push r13
push r14
push r15
mov r12, rdi
mov r13, rsi
xor r14, r14
xor r15, r15
; Copy a
.while_a:
cmp r12, 0
je .while_b
; In both branches of the if, we create the new node, so we do that here
mov rdi, qword [r12 + value] ; value
mov rsi, 0 ; next
call new_node
cmp r15, 0
je .else_a
mov qword [r15 + next], rax ; tail->next = new_node(...)
mov r15, rax ; tail = tail->next
jmp .continue_a
.else_a:
mov r14, rax ; head = tail = new_node(...)
mov r15, rax
.continue_a:
mov r12, qword [r12 + next]
jmp .while_a
.while_b:
cmp r13, 0
je .return
; Call new_node(...)
mov rdi, qword [r13 + value] ; value
mov rsi, 0 ; next
call new_node
cmp r15, 0
je .else_b
mov qword [r15 + next], rax ; tail->next = new_node(...)
mov r15, rax ; tail = tail->next
jmp .continue_b
.else_b:
mov r14, rax ; head = tail = new_node(...)
mov r15, rax
.continue_b:
mov r13, qword [r13 + next]
jmp .while_b
.return:
mov rax, r14 ; Return value = head
pop r15
pop r14
pop r13
pop r12
pop rbp
ret
C-style strings
A C-style string is an array of bytes where the last byte is equal to 0. This allows us to determine the length of a string without explicitly storing it: simply traverse the string until we find this 0 byte, counting bytes as we go. In C/C++ this looks like this:
size_t strlen(char* s) {
size_t l = 0;
while(*s != 0) {
++l;
++s;
}
return l;
}
Note that 0 and the character constant \0
are exactly the same.
Finding the length of a string is of course foundational to all other string operations: copying, appending, string searching, etc. all depend on knowing how long a string is. Thus, we want to make this operation as fast as possible. The other string operations we’ll look at are
strcpy
– Copy one string to another.strcat
– Concatenate two strings into a third.strstr
– Search a string for a substring
A first translation to assembly of the above might look like this:
strlen:
; rdi = char* s
; rax = returned length
xor rax, rax
.while:
cmp byte [rdi], 0
je .return
inc rdi
inc rax
jmp .while
.return:
ret
This is one example of an assembly function which is not much longer than its
original C version, however, accessing memory one byte at a time is not the
most efficient way to implement this. In this lecture we’ll look at a number of
different ways to do this, from using string-specific instructions with the
rep
prefix, to loading more than one byte at a time, to advanced methods which
use the XMM registers to parallelize the process.
strcpy
is just a matter of copying bytes from one string to another. We assume
that the destination string has enough space. In C this is
char* strcpy(char* dest, char* src) {
while(*src != 0) {
*dest = *src;
++dest;
++src;
}
*dest = 0; // Copy terminating nul
return dest;
}
(We return the address of the final terminating 0 in dest
as its useful in
implementing strcat
below.)
or in assembly
strcpy:
; rdi = char* dest
; rsi = char* src
.while:
cmp byte [rsi], 0
je .done
movsb ; See string instructions below
inc rsi
inc rdi
jmp .while
.done:
mov byte [rdi], 0
mov rax, rdi
ret
In order to concatenate two strings, we have to copy the first into the destination, followed by the second, followed by the terminating 0.
char* strcat(char *dest, char* a, char* b) {
dest = strcpy(dest, a);
return strcpy(dest, b);
}
strcat:
; rdi = char* dest
; rsi = char* a
; rdx = char* b
push rbp
mov rbp, rsp
call strcpy
mov rdi, rax
mov rsi, rdx
call strcpy
pop rbp
ret
strstr
is the most interesting. The natural way to implement a substring
search is with a nested loop:
char* strstr(char* src, char* ptn) {
size_t tl = strlen(src);
size_t pl = strlen(ptn);
for(size_t = 0; i < tl - pl; ++i) {
bool matches = true;
for(size_t j = 0; j < pl; ++j)
if(src[i + j] != ptn[j]) {
matches = false;
break;
}
if(matches)
return src + i;
}
return 0; // Not found, null pointer
}
This implementation runs in time \(O(mn)\) time, where m is the length of the pattern, and n is the length of the search string. There are faster algorithms, but they rely on the precomputation of various skip tables for the pattern ptn.
In order to translate this into assembly, we will need to make a few modifications.
First, we want to eliminate the calls to strlen
. It should be possible to
detect the end of both the search string and the pattern in the loops:
char* strstr(char* src, char* ptn) {
for(size_t i = 0; ; ++i) {
bool matches = true;
for(size_t j = 0; ptn[j] != 0; ++j) {
if(src[i + j] == 0)
break;
if(src[i + j] != ptn[j]) {
matches = false;
break;
}
}
if(matches)
return src + i;
}
return 0; // Not found
}
The second adjustment is to notice that inside the function, we never use src
except as src + i
. So we can replace the outer integer loop with a pointer-based:
loop:
char* strstr(char* src, char* ptn) {
for(; ; ++src) {
bool matches = true;
for(size_t j = 0; ptn[j] != 0; ++j) {
if(src[j] == 0)
break;
if(src[j] != ptn[j]) {
matches = false;
break;
}
}
if(matches)
return src;
}
return 0; // Not found
}
We can replace the inner loop similarly, provided we save the original ptn
,
so that we can restart at the beginning:
char* strstr(char* src, char* ptn) {
for(; ; ++src) {
bool matches = true;
for(char* p = ptn, char* s = src; *p != 0 && *s != 0; ++p, ++s)
if(*s != *p) {
matches = false;
break;
}
if(matches)
return src;
}
return 0; // Not found
}
Because the end-of-string condition inside the inner loop was at the beginning,
we can fold it into the loop condition. Finally, note that we don’t actually need
to store the value of the variable matches
. For each comparison in the inner
loop, if the two bytes are not equal, then we jump to the update step of the
outer loop. On the other hand, if the inner loop completes normally, this
means that all of the bytes were equal, and we can return. The ability of
assembly to jump to an arbitrary label in this case actually allows us to
simplify the code, removing a flag variable.
This function is simple enough that we can translate it into assembly.
strstr:
; rdi = char* src
; rsi = char* ptn
; rax = returned char*
; rsi = s
; rdi = p
; rax = src
; r11 = ptn
mov rax, rdi
mov r11, rsi
.src_loop:
mov rdi, r11 ; p = ptn
mov rsi, rax ; s = src
.ptn_loop:
cmp byte [rdi], 0
je .end_ptn_loop
cmp byte [r11], 0
je .end_ptn_loop
cmpsb ; [rdi] != [rsi]
jne .cont_ptn_loop
ret ; return src
.cont_ptn_loop:
inc rdi
inc rsi
jmp .ptn_loop
.cont_src_loop:
inc rax
jmp .src_loop
To do the actual comparison of byte [rdi]
with byte [rsi]
, we use the
string instruction cmpsb
, which allows us to compare two memory operands,
something we would not normally be able to do. cmps
is one of a family
of string instructions, all of which use rdi/rsi
implicitly as addresses to
read from.
String instructions
There are a number of string-specific instructions, all ending with s
,
which share several common features:
They all implicitly use the
rdi
andrsi
registers, as addresses. They are either used as two source operands (e.g., for the string comparisoncmps
) or as source (rsi
) and destination (rdi
) for transfer-like operations.They all implicitly increment
rdi
andrsi
, if used, each time through. When the repetition finally terminates,rdi
andrsi
are left in their final positions.All accept the
rep
prefix, which causes the processor itself to repeat the instruction, without any looping/branching on our part.They technically allow strings of bytes, words, dwords, or qwords, but we will only use the byte variants.
The rep
prefixes
The rep
prefixes are modifiers that can be applied to a handful of instructions
telling the CPU to repeat them internally (i.e., no comparison/branch required!)
until some condition is fulfilled. The prefixes available are
rep
– repeatrcx
times, likeloop
. The repetition continues untilrcx == 0
. All of the otherrep
prefixes implicitly use this condition as well; that is,repe
will stop repeating if[rdi] != [rsi]
orrcx == 0
.repe
– repeat until[rdi] != [rsi]
or in some cases, until[rdi] != rax
. This basically loops until the zero flag is set.repz
is an alias for this prefix.repne
– repeat until[rdi] == [rsi]
or in some cases, until[rsi] == rax
. This basically loops until the zero flag is clear.repnz
is an alias for this prefix.
The rep
prefix can be used with string instructions movs, lods, stos
. The
repe/repne
prefixes can be used with string instructions cmps
and scas
.
Note that no matter what prefix is used, the repetition will terminate when
rcx == 0
. Thus, if you don’t want rcx
to have any effect on the number of
times the instruction is repeated, set it to the largest possible unsigned
qword value:
mov rcx, -1
Transfer instructions: lods
, stos
and movs
There are three instructions which transfer data from memory to al
, from
al
to memory, and from memory to memory. These are
lods
, stos
and movs
:
Instruction | Description |
---|---|
lodsb |
Load byte [rdi] into al |
stosb |
Write byte from al into byte [rdi] |
movsb |
Copy byte from [rsi] to [rsi] |
The rep
prefix causes these operations to run rcx
times. None of the other
prefixes can be used with them, thus, the length of the input string(s) must
be known in advance. Still, stosb
can be used to fill an array of memory
with a constant byte, while movsb
can be used to copy one string into
another. rep lodsb
is not particularly useful, as it repeatedly loads bytes
into al
but then doesn’t do anything with them.
Comparison instructions: cmps
and scas
cmpsb
compares the byte at [rdi]
with the byte at [rsi]
and updates the
flags accordingly. Because the repe/repne
prefixes use ZF, this can be used
to compare a pair of strings byte-wise, stopping at the first pair of bytes
which are equal/not equal.
scas
compares byte [rdi]
with al
and updates the flags accordingly. Thus,
while cmps
corresponds to comparing two strings, byte-wise, scas
corresponds
to searching through a string for a particular character, stored in al
.
Again, repe/repne
can be used to search for the first occurrence of a byte
which is equal/not equal to al
.
Question: can we use the repe
prefix to simplify the implementation of
strstr
above? Personally, I don’t believe so, as our termination condition
is more complex than just !=
; we also have to check for the terminating \0
in either list, but I’d be happy to be proved wrong!
String-operation strlen
We can use scas
to implement a rep
-enabled version of strlen
:
strlen:
; rdi = char*
; rax = return length
mov rcx, -1 ; Max 64-bit unsigned value
mov r11, rdi ; Save original rdi
xor al, al ; al = '\0'
repne scasb
; Now rdi points to the 0 byte
sub rdi, rax
mov rdi, rax
ret
Parallelizing strlen
Our current versions of strlen
load only one byte at a time. We can, by using
the full width of the 64-bit registers, load 8 bytes in the same amount of time,
but then we have the problem of detecting a single 0 byte anywhere inside those
8 bytes. The easiest way to do this is to repeatedly shift and examine the
low/high byte parts of the register:
mov rbx, qword [rdi]
cmp bl, 0
je .low
cmp bh, 0
je .high
; Check next word
shr rbx, 16
add rdi, 2
cmp bl, 0
...
Because there are four words, we will need to go through this shift/compare
process four times (it will be faster if we write them all out by hand, rather
than using a loop). The main trouble is that we’re using a general-purpose
register (rbx
) as if it was a vector register of 8 bytes, which it isn’t.
If we switch to using the xmm
registers, we have more instructions at our
disposal which can treat all the bytes equally.
This is a portion of the optimized strlen
from Abner Fog’s fast assembly
C routines library:
; rdi = char* s
pxor xmm0, xmm0 ; set to zero
mov rax, rdi
sub rax, 10H ; rax = rdi - 16
; Main loop, search 16 bytes at a time
L1: add rax, 10H ; increment pointer by 16
movq xmm1, [rax] ; read 16 bytes
pcmpeqb xmm1, xmm0 ; compare 16 bytes with zero
pmovmskb edx, xmm1 ; get one bit for each byte result
bsf edx, edx ; find first 1-bit
jz L1 ; loop if not found
; Zero-byte found. Compute string length
sub rax, rdi ; subtract start address
add rax, rdx ; add byte index
ret
The essential idea of the algorithm is
Load 16 bytes into
xmm1
from addressrax
Compare each byte separately with 0 (
xmm0
is filled with 0s). The result of a vector comparison, remember, is not to change the flags register, but to set each byte to 1s if the comparison was true, or 0s if it was not.Copy 1 bit from each byte in
xmm1
(the true/false compare result) intoedx
.Use the bit-search function to find the first bit which is set to 1 in
edx
. This indicates that the corresponding byte inxmm1
was 0, terminating the string.If no bit was set, repeat.
Otherwise, add the index of the set bit, plus the offset of
rax
from the beginning of the string, to get the length.
The movq
instruction loads a quadword from memory into an xmm
register (or
writes a quadword from an xmm
register into memory). movq
is a bit slow
because it allows for unaligned reads; Fog’s original implementation used
movdqa
which requires that the address be a multiple of 16; misaligned
strings are handled specially by first examining the unaligned portion before
entering the main loop.
The pcmpeqb
instruction (short for Compare Packed Equal Bytes) compares two
xmm
registers byte-wise for equality, and sets each byte in the destination
to all 1s if true, or all 0s if false.
The pmovmskb
instruction (Move Mask Bytes) takes the high bit of each
byte component of an xmm
register, and copies them into the bits of a
general-purpose register. This effectively gives us the same 0-or-1 results
from the comparison, but packed into the bits of edx
instead of spread out
over xmm1
.
The bsf
instruction (Bit Scan Forward) instruction searches for the first
(lowest) bit which is set.
If a bit is set, then
edx
is set to the index of this bit and the zero flag is cleared.If no bits are set, then the zero flag is set.
If no bits are set in edx
, then none of the bytes in the 16 we loaded were
0, so we have not reached the end of the string yet. Otherwise, edx
is the
offset of the end-of-string byte within the 16 that we loaded.
Let’s look at an example to see how this will work: we want to find the
length of the string "The quick brown fox jumped over the lazy dog."
. This
string has 45 characters:
T | h | e | ␣ | q | u | i | c | k | ␣ | b | r | o | w | n | ␣ | f | o | x | ␣ | j | u | m | p | e | d | ␣ | o | v | e | r | ␣ | t | h | e | ␣ | l | a | z | y | ␣ | d | o | g | . | \0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 |
We pre-load xmm0
with all 0s:
xmm0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Byte | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
The first iteration through the loop will load The quick brown
into xmm1
.
xmm1 | T | h | e | ␣ | q | u | i | c | k | ␣ | b | r | o | w | n | ␣ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Byte | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
Since none of these bytes are \0
, the comparison will set everything in xmm1
to 0s:
xmm1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Byte | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
Which we then copy into edx
, as single bits: edx = 0000000000000000b
.
Because no bit is set, bsf
will set the zero flag ZF = 1
, causing the jump
to L1
.
The next cycle, rax = 16
, so we will load fox jumped over
into xmm1
. As
there are no zero bytes here, either, so the process repeats.
Finally, we will load the lazy dog.\0
plus an additional 2 bytes beyond the
end of the string into xmm1
:
xmm1 | t | h | e | ␣ | l | a | z | y | ␣ | d | o | g | . | \0 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Byte | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
After the comparison, we will have xmm1
set as
xmm1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Byte | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
This is then copied into edx
as 0010000000000000b
. bsf
will set edx
to
13, the index of the set bit, while also setting ZF = 0
. This terminates the
loop.
The final length of the string is 32 (rax
) plus 13 (bit index) = 45, as
expected.