Review: MIPS pipelining

The MIPS (not x86!) pipeline is broken into five stages:

Instruction fetch (IF): The instruction is loaded from memory/cache, and the instruction pointer is incremented. This is relatively simple as on MIPS, every instruction is 32 bits wide. (On x86, different instructions have different widths, which makes fetching/decoding more complex.)
Instruction decode (ID): Setup the following stages to do whatever the instruction does. Also computes branch targets for branch instructions.
Execute (EX): perform the instruction’s operation (addition, bitwise operation, etc.).
Memory access (MEM): read from memory or write to memory. Only load/store instructions are allowed to access memory, all others operate only on registers.
Writeback (WB): Write results back into the output register (not into memory).

Because these stages are largely independent, using different resources on the CPU, they can be overlapped: while one instruction is being decoded, the following instruction can already be being fetched. Thus, if there are n instructions in a program, instead of taking 5n cycles to complete (as if no overlapping were allowed), it takes only n+4 cycles (+4 because it takes four cycles at the beginning to fill the pipeline).

For example, if we have two instructions I1 followed by I2 we end up with the following timeline:

Time:	0	1	2	3	4	5	…
IF	I1	I2
ID		I1	I2
EX			I1	I2
MEM				I1	I2
WB					I1	I2

Sometimes instructions cannot be overlapped; this is the result of various kinds of hazards:

Data hazard: the result of one instruction is used in the immediately following instruction. The result is not normally available until the WB stage.
Structural hazard: two instructions need to use the same CPU resource (not the same stage) at the same time. E.g., the CPU’s arithmetic unit might be used in the EX stage (for arithmetic operations), in the ID stage (for computing branch targets), or in the MEM stage (for computing effective memory addresses). If two instructions are overlapped which both require the arithmetic unit at the same time, the later one will have to wait, unless multiple redundant units are available.
Control hazard: arises from (conditional) jumps, where the “next instruction” is not the immediately following one (which has already started being fetched). Solved by predicting the branch to go one way or another (and being penalized when the prediction is wrong), or requiring a branch delay slot (the next instruction after the branch is always executed, even when it shouldn’t be!).

As an example of figuring out hazards, consider the following pseudo-MIPS assembly program:

L1:     LOAD r1, address   ; r1 = *address
A2:     ADD  r2, r1, r1    ; r2 = r1 + r1
S2:     SUB  r2, r3, r2    ; r2 = r3 - r2

All instructions need to go through IF and ID stages.
The EX stage for LOAD is a do-nothing (still takes 1 cycle, but nothing happens). The results of the LOAD in register r1 are not available until WB.
Similarly, the results in r2/r3 of the addition/subtraction are not available until after their respective WB stages.

Scheduling the pipeline with the necessary stalls for these three instruction leads to:

Time:	0	1	2	3	4	5	6	7	8	9	10
IF	L1	A2	S2
ID		L1	A2	S2	..	..
EX			L1	..	..	A2	..	..	S3
MEM				L1			A2			S3
WB					L1			A2			S3

Optimal execution for three instructions would be 3+4 = 7 cycles, but due to the data hazards, this sequence requires 11 cycles (still better than the 15 it would require with no pipelining at all).

In general, if I2 depends on the results of previous I1, then I2 must stall until I1’s writeback stage is complete. (We mentioned that bypassing, sending the results of the EX stage back into its inputs, is a possible solution to stalls resulting from data hazards.)

Bitwise operations

Bitwise operations become even more important at the assembly level. We’ve already seen xor, exclusive-or, here we’ll see the other operations or, and, not, and andn (and-not), as well as the bit-shift and rotation operations. In addition to these operations, which have direct analogs in the C/C++ bitwise operators, assembly supports additional bitwise operations that have no C/C++ equivalent. Finally, the test operation is the bitwise comparison operator, used for building conditional branches which depend on the state of certain bits.

Bitwise operations: AND, OR, NOT, AND-NOT

not  dest         ; dest = ~dest
and  dest, src    ; dest = dest & src
or   dest, src    ; dest = dest | src
andn dest, src    ; dest = dest & ~src
xor  dest, src    ; dest = dest ^ src

The usual restrictions are in place: both operands must be of the same size, they cannot both be in memory, and only the source operand can be an immediate. not can be used with either a register or a memory location.

It may be useful to review the truth tables for these operations.

~a is 1 if a is 0 and 0 if a is 1:

a ~a

0 1

1 0
a & b is 1 only if both a and b are 1, otherwise it is 0.

a b a & b

0 0 0

1 0 0

0 1 0

1 1 1
a | b is 1 if either a or b is 1, and is 0 only if both of them are.

a b a OR b

0 0 0

1 0 1

0 1 1

1 1 1
a & ~b is 1 if a is 1 and b is 0, and 0 otherwise.

a b a & ~b

0 0 0

1 0 1

0 1 0

1 1 0
a ^ b is 1 if either of a or b is 1, but not when both or neither are.

a b a ^ b

0 0 0

1 0 1

0 1 1

1 1 0

a	`~a`
0	1
1	0

a	b	`a & b`
0	0	0
1	0	0
0	1	0
1	1	1

a	b	`a OR b`
0	0	0
1	0	1
0	1	1
1	1	1

a	b	`a & ~b`
0	0	0
1	0	1
0	1	0
1	1	0

a	b	`a ^ b`
0	0	0
1	0	1
0	1	1
1	1	0

The above tables demonstrate how the operations are applied to pairs of single bits. When applied to binary values, the operation is applied across all pairs of bits. For example,

   01101101
 & 11101110
------------ 
   01101100

Setting, unsetting, and flipping bits

Common operations on individual bits include setting a particular bit (forcing it to 1), unsetting (forcing to 0), and flipping (1 to 0 and 0 to 1). The first stage of this process is to construct a bitmask, a value which is 1 in the bits that we wish to manipulate, and 0 everywhere else. If the bit(s) we are interested in are constant, then this is easy: just write a binary constant:

bitmask:   EQU   10000000b    ; Manipulate bit 7

If the bit positions are not constant, then we can use shifts (see below) to construct the mask dynamically.

To set a bit, while leaving all other bits unchanged, we use or. x or 0 == x, so the bits of the mask which are unset will be unchanged, but x or 1 == 1, so the set bit in the mask will be forced to 1.

or  rax, bitmask

To unset (clear) a bit, we need a bitwise operation OP which satisfies x OP 0 == x and x OP 1 == 0. Looking at the above tables, we find that AND-NOT is what we want:

andn rax, bitmask

To flip a bit, we need an operation OP where x OP 0 == x but x OP 1 == ~x. Again, from the tables we see that XOR is what we want:

xor rax, bitmask

These can be used with bitmasks with more than one set bit, to set/clear/flip multiple bits at once (of course, the same operation is done to all bits; if you want to clear some bits and set others, you’ll need multiple masks and multiple operations).

Sign extension

Above, the assembler took care of expanding the value of bitmask to 64 bits (the size of rax), so that the sizes would match. However, the bitwise operations can actually be used with immediates smaller than the destination. We can force this behavior by specifying an explicit size for the bitmask immediate:

mov rax, 0
or  rax, byte bitmask ; What bit(s) does this set?

When the immediate value is (explicitly) smaller than the destination, it is sign extended to the size of the destination. Sign extension is a process used to convert a signed binary value with x bits to a signed binary value with y bits, with \(y > x\). To see why some conversion is necessary, let’s look at the twos-complement representation of -1 as a single byte. (Remember that we construct the twos-complement representation by flipping all the bits and adding 1.) So start with 00000001b, flip all the bits giving 11111110 and then add 1, resulting in 11111111b. Now we want to expand this to 16 bits. If we simply add 0s, then we have

0000000011111111b = 127 (!)

Remember that if the high bit is unset, the value is positive! Instead, we copy the highest bit of 11111111b into the new bits we add:

1111111111111111b = -1

Whenever we expand a signed value, we need to perform sign extension. Conversely, if we are expanding an unsigned value, then we must not perform sign extension, as it will give the wrong result. To expand an unsigned value, we perform zero extension, filling the high bits with zeros.

These conversions can be performed between registers using the operations movsx and movzx:

movsx dest, src      ; Sign-extend src into dest
movzx dest, src      ; Zero-extent src into dest

It’s up to you to keep track of whether you are dealing with signed or unsigned values and use the correct conversion operation!

Shifts and rotates

In C/C++ we have the << and >> operators, which perform left-bit-shift and right-bit-shift. They are typically used as “shortcuts” for multiplying/dividing by 2. To see why this works, consider the value 3 in binary:

00000011    = 3

If we shift the bits to the left, then what was the 1-valued bit becomes 2, and what was the 2-valued bit becomes 4, giving us

00000110    = 6

A 0 is shifted in at the low end, to fill in the unoccupied bit position. Shifting to the left one bit multiplies by 2. Shifting n bits to the left multiplies by 2ⁿ.

Similarly, shifting 6 to the right one bit gives us 3, so right-shifting is equivalent to dividing by 2ⁿ (rounding down). A 0 is shifted in at the high bit position.

For unsigned values, shifting works exactly as described above. When shifting signed values, we need to be careful of negative values. For example, -6 = 11111010b. If we shift this to the right while filling the high bit with a 0, we get

01111101 = 125 (!)

When right-shifting a signed value, we do not want to fill the high bits with zeros. Rather, we want to copy the existing high bit into them, so that the sign is preserved:

11111101 = -3

A sign-preserving right shift is called an arithmetic shift, while a zero-filling shift is called a logical shift. Assembly has both. Note that the distinction only matters for right shifting; for left shifting, we always fill the low bit(s) with 0.

The assembly instructions for shifting are

shl dest, a    ; Logical shift left
sal dest, a    ; Arithmetical shift left
shr dest, a    ; Logical shift right
sar dest, a    ; Arithmetical shift right

Although assembly has both shl and sal instructions, they perform the same operation: shift left, fill low bits with 0s.

The operand a, the number of bits to shift, is severely restricted: it must be either an 8-bit immediate, or the cl register (low 8 bits of rcx). The maximum shift value is 63.

When shifting, the high/low bit that is shifted away is placed in the carry flag CF. Multi-bit shifts behave as if multiple 1-bit shifts were performed in sequence. (I.e., the last bit shifted out will be placed in CF.)
For 1-bit shifts, the OF flag is set if the sign has changed.

Using bit-shifts to construct masks

We can use bitshifts to construct bitmasks on-the-fly, rather than hard-coding them. For example, suppose we want to construct the bitmask 00000000011001000b. This mask has bits 3, 6, and 7 set. To create this, we start by constructing a mask with 1 in the 0 bit position, and then shift it up by the desired amounts:

mov rax, 0  ; bitmask
mov rbx, 1

shl rbx, 3    ; Bit 3 of rbx is set 
or  rax, rbx  ; Bit 3 of rax is set
shl rbx, 3    ; Bit 6 of rbx is set
or  rax, rbx  ; Bits 3 and 6 of rax are set
shl rbx, 1    ; Bit 7 of rbx is set
or  rax, rbs  ; Bits 3, 6, and 7 are set

Rotations

In addition to shifting bits, assembly has the ability to rotate the bits left or right; the high/low bit that would otherwise be discarded (placed in CF) is simply moved around to the other end, to fill in the unoccupied bit position. Rotations do not have a direct mathematical analogue, and thus are less useful than shifts.

ror  dest, a        ; Rotate dest a bits to the right    
rol  dest, a        ; Rotate dest a bits to the left

As with bit-shifts, the amount a must be either an immediate or the cl register, and must be in the range of 0 to 63. The CF flag is set to a copy of the last bit to be “moved” from one end to the other.

Testing bits

The test instruction performs a bitwise AND and then sets up the SF and ZF flags accordingly. This can be used, together with a bitmask, to test whether a single bit (or set of bits) is on:

test rax, 00010000b
jnz target              ; Jump if bit 4 is set in rax

It can also be used to test whether a value is 0:

test rax, rax 
; If ZF == 1 then rax == 0

and also to test the sign of a value:

test rax, rax
; If SF == 1 then rax < 0

A shorter way to test a specific bit is to use the bt instruction:

bt a, b

bt copies the value of the b-th bit in a into the carry flag CF. Thus, if bit b is set in a, CF == 1, otherwise it equals 0. a can be either register or memory, and b can be either register or immediate. Thus, another way to test bit 4 of rax would be

bt rax, 4
jc target

There are variants of the bit-test instruction which not only copy the bit to CF, but also set, unset (“reset”), and flip (complement) the bit at the same time:

btc a, b    ; Bit test and complement
btr a, b    ; Bit test and reset
bts a, b    ; Bit test and set

Searching for bits

Sometimes we need to scan through a value and find the first bit on either the low end, or the high end, which is set. For example, in

00111000

the first lowest bit set is at position 3, while the first highest bit is at position 5. The instructions bsf (“bit scan forward”) and bsr (“bit scan reverse”) give us this information:

bsf dest, src
bsr dest, src

These scan through the source (register/memory) and write the bit position into the destination (register).

Finally, there are number of odd bit instructions which search for bits while flipping others and the like. These are blsi, bzhi, blsr, blsmask, bextr and a few others. Look them up if you are interested in what they do.

Application: pseudo-random number generation

It’s not really possible to generate true random numbers purely algorithmically on a computer; an algorithm is a series of steps, so if you repeat those steps exactly, you’ll get the same results. Rather, we seek “pseudo-randomness”, a sequence of values which appear to be random, if you ignore the fact that we could repeat them by re-running the algorithm. There are a number of algorithms for generating PRNGs. Most of these maintain some internal state, which persists from one number to the next. The numbers generated are derived from the state, often just consisting of the high n bits of it, but sometimes using more involved transformations.

The oldest and simplest is a linear congruential generator, which has the basic form

state = (state * A + B) % m

where A and B are constants that must be chosen carefully. Modern generators often exploit bitwise operations, as this allows more mixing of the bits in the state. The SplitMix generator, used in Java 8, uses a 64-bit state together with right-shifts and XORs, looks like this in C/C++:

static uint64_t x = ... ; // Seed/state

uint64_t next() {
    uint64_t z = (x += 0x9e3779b97f4a7c15);
    z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
    z = (z ^ (z >> 27)) * 0x94d049bb133111eb;

    return z ^ (z >> 31); 
}

(Source: http://xoshiro.di.unimi.it/splitmix64.c)

Note that the value returned is not the internal state x, but a variant of it z. The state x advances quite simply, being incremented by a constant.

To implement this in assembly, we’ll store the state in memory, and then write a function to advance it and return the next random value.

section .data

state:      dq      137546  ; Seed value

section .text

next:
    push rbp
    mov rbp, rsp

    ; Constants
    mov r8,  0xbf58476d1ce4e5b9
    mov r9,  0x94d049bb133111eb
    mov r10, 0x9e3779b97f4a7c15 

    ; Update state
    add qword [state], r10
    mov rax, qword [state]

    ; z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9
    mov rbx, rax
    shr rbx, 30
    xor rax, rbx
    mul r8

    ; z = (z ^ (z >> 27)) * 0x94d049bb133111eb
    mov rbx, rax
    shr rbx, 27
    xor rax, rbx
    mul r9

    ; return z ^ (z >> 31)
    mov rbx, rax
    shr rbx, 31
    xor rax, rbx

    pop rbp
    ret

We can test this by writing a main which prints out the random bytes generated (note: not the random numbers, but the raw byte values) and then feed them to the PractRand test suite. Here’s the rest of the program:

section .data

state:      dq      137546  ; Seed value

buffer:     dq      0            

section .text

global _start
_start:

    push rbp
    mov rbp, rsp

.loop:

    call next
    mov [buffer], rax

    mov rax, 1          ; Write syscall
    mov rdi, 1          ; Stdout
    mov rsi, buffer     ; Address 
    mov rdx, 8          ; Length
    syscall

    jmp .loop

    pop rbp

    mov rax, 60
    mov rdi, 0
    syscall

next: 
    ... ; Remainder of next function

and then we can test it with

./splitmix | RNG_test stdin

This basically lets splitmix run forever, constantly spitting out random data, which is consumed by RNG_test. RNG_test runs a battery of statistical tests on the data it receives and reports any problems, for increasingly large blocks of data.