Review: MIPS pipelining
The MIPS (not x86!) pipeline is broken into five stages:
Instruction fetch (IF): The instruction is loaded from memory/cache, and the instruction pointer is incremented. This is relatively simple as on MIPS, every instruction is 32 bits wide. (On x86, different instructions have different widths, which makes fetching/decoding more complex.)
Instruction decode (ID): Setup the following stages to do whatever the instruction does. Also computes branch targets for branch instructions.
Execute (EX): perform the instruction’s operation (addition, bitwise operation, etc.).
Memory access (MEM): read from memory or write to memory. Only load/store instructions are allowed to access memory, all others operate only on registers.
Writeback (WB): Write results back into the output register (not into memory).
Because these stages are largely independent, using different resources on the CPU, they can be overlapped: while one instruction is being decoded, the following instruction can already be being fetched. Thus, if there are n instructions in a program, instead of taking 5n cycles to complete (as if no overlapping were allowed), it takes only n+4 cycles (+4 because it takes four cycles at the beginning to fill the pipeline).
For example, if we have two instructions I1 followed by I2 we end up with the following timeline:
Time: | 0 | 1 | 2 | 3 | 4 | 5 | … |
---|---|---|---|---|---|---|---|
IF | I1 | I2 | |||||
ID | I1 | I2 | |||||
EX | I1 | I2 | |||||
MEM | I1 | I2 | |||||
WB | I1 | I2 |
Sometimes instructions cannot be overlapped; this is the result of various kinds of hazards:
Data hazard: the result of one instruction is used in the immediately following instruction. The result is not normally available until the WB stage.
Structural hazard: two instructions need to use the same CPU resource (not the same stage) at the same time. E.g., the CPU’s arithmetic unit might be used in the EX stage (for arithmetic operations), in the ID stage (for computing branch targets), or in the MEM stage (for computing effective memory addresses). If two instructions are overlapped which both require the arithmetic unit at the same time, the later one will have to wait, unless multiple redundant units are available.
Control hazard: arises from (conditional) jumps, where the “next instruction” is not the immediately following one (which has already started being fetched). Solved by predicting the branch to go one way or another (and being penalized when the prediction is wrong), or requiring a branch delay slot (the next instruction after the branch is always executed, even when it shouldn’t be!).
As an example of figuring out hazards, consider the following pseudo-MIPS assembly program:
L1: LOAD r1, address ; r1 = *address
A2: ADD r2, r1, r1 ; r2 = r1 + r1
S2: SUB r2, r3, r2 ; r2 = r3 - r2
All instructions need to go through IF and ID stages.
The EX stage for
LOAD
is a do-nothing (still takes 1 cycle, but nothing happens). The results of theLOAD
in registerr1
are not available until WB.Similarly, the results in
r2
/r3
of the addition/subtraction are not available until after their respective WB stages.
Scheduling the pipeline with the necessary stalls for these three instruction leads to:
Time: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|---|
IF | L1 | A2 | S2 | ||||||||
ID | L1 | A2 | S2 | .. | .. | ||||||
EX | L1 | .. | .. | A2 | .. | .. | S3 | ||||
MEM | L1 | A2 | S3 | ||||||||
WB | L1 | A2 | S3 |
Optimal execution for three instructions would be 3+4 = 7 cycles, but due to the data hazards, this sequence requires 11 cycles (still better than the 15 it would require with no pipelining at all).
In general, if I2 depends on the results of previous I1, then I2 must stall until I1’s writeback stage is complete. (We mentioned that bypassing, sending the results of the EX stage back into its inputs, is a possible solution to stalls resulting from data hazards.)
Bitwise operations
Bitwise operations become even more important at the assembly level. We’ve
already seen xor
, exclusive-or, here we’ll see the other operations or
,
and
, not
, and andn
(and-not), as well as the bit-shift and rotation
operations. In addition to these operations, which have direct analogs in the
C/C++ bitwise operators, assembly supports additional bitwise operations that
have no C/C++ equivalent. Finally, the test
operation is the bitwise
comparison operator, used for building conditional branches which depend on
the state of certain bits.
Bitwise operations: AND, OR, NOT, AND-NOT
not dest ; dest = ~dest
and dest, src ; dest = dest & src
or dest, src ; dest = dest | src
andn dest, src ; dest = dest & ~src
xor dest, src ; dest = dest ^ src
The usual restrictions are in place: both operands must be of the same size,
they cannot both be in memory, and only the source operand can be an immediate.
not
can be used with either a register or a memory location.
It may be useful to review the truth tables for these operations.
~a
is 1 if a is 0 and 0 if a is 1:a ~a
0 1 1 0 a & b
is 1 only if both a and b are 1, otherwise it is 0.a b a & b
0 0 0 1 0 0 0 1 0 1 1 1 a | b
is 1 if either a or b is 1, and is 0 only if both of them are.a b a OR b
0 0 0 1 0 1 0 1 1 1 1 1 a & ~b
is 1 if a is 1 and b is 0, and 0 otherwise.a b a & ~b
0 0 0 1 0 1 0 1 0 1 1 0 a ^ b
is 1 if either of a or b is 1, but not when both or neither are.a b a ^ b
0 0 0 1 0 1 0 1 1 1 1 0
The above tables demonstrate how the operations are applied to pairs of single bits. When applied to binary values, the operation is applied across all pairs of bits. For example,
01101101
& 11101110
------------
01101100
Setting, unsetting, and flipping bits
Common operations on individual bits include setting a particular bit (forcing it to 1), unsetting (forcing to 0), and flipping (1 to 0 and 0 to 1). The first stage of this process is to construct a bitmask, a value which is 1 in the bits that we wish to manipulate, and 0 everywhere else. If the bit(s) we are interested in are constant, then this is easy: just write a binary constant:
bitmask: EQU 10000000b ; Manipulate bit 7
If the bit positions are not constant, then we can use shifts (see below) to construct the mask dynamically.
To set a bit, while leaving all other bits unchanged, we use or
. x or 0 == x
,
so the bits of the mask which are unset will be unchanged, but x or 1 == 1
, so
the set bit in the mask will be forced to 1.
or rax, bitmask
To unset (clear) a bit, we need a bitwise operation OP
which satisfies
x OP 0 == x
and x OP 1 == 0
. Looking at the above tables, we find that
AND-NOT is what we want:
andn rax, bitmask
To flip a bit, we need an operation OP
where x OP 0 == x
but
x OP 1 == ~x
. Again, from the tables we see that XOR is what we want:
xor rax, bitmask
These can be used with bitmasks with more than one set bit, to set/clear/flip multiple bits at once (of course, the same operation is done to all bits; if you want to clear some bits and set others, you’ll need multiple masks and multiple operations).
Sign extension
Above, the assembler took care of expanding the value of bitmask
to
64 bits (the size of rax
), so that the sizes would match. However, the
bitwise operations can actually be used with immediates smaller than the
destination. We can force this behavior by specifying an explicit size for
the bitmask immediate:
mov rax, 0
or rax, byte bitmask ; What bit(s) does this set?
When the immediate value is (explicitly) smaller than the destination, it is
sign extended to the size of the destination. Sign extension is a process
used to convert a signed binary value with x bits to a signed binary value
with y bits, with \(y > x\). To see why some conversion is necessary,
let’s look at the twos-complement representation of -1 as a single byte.
(Remember that we construct the twos-complement representation by flipping
all the bits and adding 1.) So start with 00000001b
, flip all the bits
giving 11111110
and then add 1, resulting in 11111111b
. Now we want to
expand this to 16 bits. If we simply add 0s, then we have
0000000011111111b = 127 (!)
Remember that if the high bit is unset, the value is positive! Instead, we
copy the highest bit of 11111111b
into the new bits we add:
1111111111111111b = -1
Whenever we expand a signed value, we need to perform sign extension. Conversely, if we are expanding an unsigned value, then we must not perform sign extension, as it will give the wrong result. To expand an unsigned value, we perform zero extension, filling the high bits with zeros.
These conversions can be performed between registers using the operations
movsx
and movzx
:
movsx dest, src ; Sign-extend src into dest
movzx dest, src ; Zero-extent src into dest
It’s up to you to keep track of whether you are dealing with signed or unsigned values and use the correct conversion operation!
Shifts and rotates
In C/C++ we have the <<
and >>
operators, which perform left-bit-shift and
right-bit-shift. They are typically used as “shortcuts” for multiplying/dividing by
2. To see why this works, consider the value 3 in binary:
00000011 = 3
If we shift the bits to the left, then what was the 1-valued bit becomes 2, and what was the 2-valued bit becomes 4, giving us
00000110 = 6
A 0 is shifted in at the low end, to fill in the unoccupied bit position.
Shifting to the left one bit multiplies by 2. Shifting n bits to the
left multiplies by 2
Similarly, shifting 6 to the right one bit gives us 3, so right-shifting is
equivalent to dividing by 2
For unsigned values, shifting works exactly as described above. When shifting
signed values, we need to be careful of negative values. For example,
-6 = 11111010b
. If we shift this to the right while filling the high bit with
a 0, we get
01111101 = 125 (!)
When right-shifting a signed value, we do not want to fill the high bits with zeros. Rather, we want to copy the existing high bit into them, so that the sign is preserved:
11111101 = -3
A sign-preserving right shift is called an arithmetic shift, while a zero-filling shift is called a logical shift. Assembly has both. Note that the distinction only matters for right shifting; for left shifting, we always fill the low bit(s) with 0.
The assembly instructions for shifting are
shl dest, a ; Logical shift left
sal dest, a ; Arithmetical shift left
shr dest, a ; Logical shift right
sar dest, a ; Arithmetical shift right
Although assembly has both shl
and sal
instructions, they perform the
same operation: shift left, fill low bits with 0s.
The operand a, the number of bits to shift, is severely restricted: it must
be either an 8-bit immediate, or the cl
register (low 8 bits of rcx
). The
maximum shift value is 63.
When shifting, the high/low bit that is shifted away is placed in the carry flag CF. Multi-bit shifts behave as if multiple 1-bit shifts were performed in sequence. (I.e., the last bit shifted out will be placed in CF.)
For 1-bit shifts, the OF flag is set if the sign has changed.
Using bit-shifts to construct masks
We can use bitshifts to construct bitmasks on-the-fly, rather than hard-coding
them. For example, suppose we want to construct the bitmask 00000000011001000b
.
This mask has bits 3, 6, and 7 set. To create this, we start by constructing
a mask with 1 in the 0 bit position, and then shift it up by the
desired amounts:
mov rax, 0 ; bitmask
mov rbx, 1
shl rbx, 3 ; Bit 3 of rbx is set
or rax, rbx ; Bit 3 of rax is set
shl rbx, 3 ; Bit 6 of rbx is set
or rax, rbx ; Bits 3 and 6 of rax are set
shl rbx, 1 ; Bit 7 of rbx is set
or rax, rbs ; Bits 3, 6, and 7 are set
Rotations
In addition to shifting bits, assembly has the ability to rotate the bits left or right; the high/low bit that would otherwise be discarded (placed in CF) is simply moved around to the other end, to fill in the unoccupied bit position. Rotations do not have a direct mathematical analogue, and thus are less useful than shifts.
ror dest, a ; Rotate dest a bits to the right
rol dest, a ; Rotate dest a bits to the left
As with bit-shifts, the amount a must be either an immediate or the cl
register, and must be in the range of 0 to 63. The CF flag is set to a copy
of the last bit to be “moved” from one end to the other.
Testing bits
The test
instruction performs a bitwise AND and then sets up the SF and ZF
flags accordingly. This can be used, together with a bitmask, to test whether
a single bit (or set of bits) is on:
test rax, 00010000b
jnz target ; Jump if bit 4 is set in rax
It can also be used to test whether a value is 0:
test rax, rax
; If ZF == 1 then rax == 0
and also to test the sign of a value:
test rax, rax
; If SF == 1 then rax < 0
A shorter way to test a specific bit is to use the bt
instruction:
bt a, b
bt
copies the value of the b-th bit in a into the carry flag CF. Thus,
if bit b is set in a, CF == 1, otherwise it equals 0. a can be either
register or memory, and b can be either register or immediate. Thus, another
way to test bit 4 of rax
would be
bt rax, 4
jc target
There are variants of the bit-test instruction which not only copy the bit to CF, but also set, unset (“reset”), and flip (complement) the bit at the same time:
btc a, b ; Bit test and complement
btr a, b ; Bit test and reset
bts a, b ; Bit test and set
Searching for bits
Sometimes we need to scan through a value and find the first bit on either the low end, or the high end, which is set. For example, in
00111000
the first lowest bit set is at position 3, while the first highest bit is at
position 5. The instructions bsf
(“bit scan forward”) and bsr
(“bit scan
reverse”) give us this information:
bsf dest, src
bsr dest, src
These scan through the source (register/memory) and write the bit position into the destination (register).
Finally, there are number of odd bit instructions which search for bits while
flipping others and the like. These are blsi
, bzhi
, blsr
, blsmask
,
bextr
and a few others. Look them up if you are interested in what they do.
Application: pseudo-random number generation
It’s not really possible to generate true random numbers purely algorithmically on a computer; an algorithm is a series of steps, so if you repeat those steps exactly, you’ll get the same results. Rather, we seek “pseudo-randomness”, a sequence of values which appear to be random, if you ignore the fact that we could repeat them by re-running the algorithm. There are a number of algorithms for generating PRNGs. Most of these maintain some internal state, which persists from one number to the next. The numbers generated are derived from the state, often just consisting of the high n bits of it, but sometimes using more involved transformations.
The oldest and simplest is a linear congruential generator, which has the basic form
state = (state * A + B) % m
where A and B are constants that must be chosen carefully. Modern generators often exploit bitwise operations, as this allows more mixing of the bits in the state. The SplitMix generator, used in Java 8, uses a 64-bit state together with right-shifts and XORs, looks like this in C/C++:
static uint64_t x = ... ; // Seed/state
uint64_t next() {
uint64_t z = (x += 0x9e3779b97f4a7c15);
z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
return z ^ (z >> 31);
}
(Source: http://xoshiro.di.unimi.it/splitmix64.c)
Note that the value returned is not the internal state x, but a variant of it z. The state x advances quite simply, being incremented by a constant.
To implement this in assembly, we’ll store the state in memory, and then write a function to advance it and return the next random value.
section .data
state: dq 137546 ; Seed value
section .text
next:
push rbp
mov rbp, rsp
; Constants
mov r8, 0xbf58476d1ce4e5b9
mov r9, 0x94d049bb133111eb
mov r10, 0x9e3779b97f4a7c15
; Update state
add qword [state], r10
mov rax, qword [state]
; z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9
mov rbx, rax
shr rbx, 30
xor rax, rbx
mul r8
; z = (z ^ (z >> 27)) * 0x94d049bb133111eb
mov rbx, rax
shr rbx, 27
xor rax, rbx
mul r9
; return z ^ (z >> 31)
mov rbx, rax
shr rbx, 31
xor rax, rbx
pop rbp
ret
We can test this by writing a main
which prints out the random bytes
generated (note: not the random numbers, but the raw byte values) and then
feed them to the PractRand test suite. Here’s
the rest of the program:
section .data
state: dq 137546 ; Seed value
buffer: dq 0
section .text
global _start
_start:
push rbp
mov rbp, rsp
.loop:
call next
mov [buffer], rax
mov rax, 1 ; Write syscall
mov rdi, 1 ; Stdout
mov rsi, buffer ; Address
mov rdx, 8 ; Length
syscall
jmp .loop
pop rbp
mov rax, 60
mov rdi, 0
syscall
next:
... ; Remainder of next function
and then we can test it with
./splitmix | RNG_test stdin
This basically lets splitmix
run forever, constantly spitting out random
data, which is consumed by RNG_test
. RNG_test
runs a battery of statistical
tests on the data it receives and reports any problems, for increasingly
large blocks of data.