SIMD: Packed Floating-Point

If the xmm registers are 128 bits wide, but floating-point values are either 32 or 64 bits wide, what’s in the remaining bits? The answer is, more floating-point values! Each xmm register can be thought of as an array of 4 single-precision or 2 double-precision floating-point values packed into a single register. All of the above operations operate only on the “low” value in a register (in array notation, xmm[0]); they usually just copy the remaining elements unchanged. However, there are a number of operations which operate on all elements of the xmm “array” simultaneously. For example, the addps adds all four single-precision floats simultaneously:

addps xmm0, xmm1
; Equivalent to:
;   xmm0[0] += xmm1[0]
;   xmm0[1] += xmm1[1]
;   xmm0[2] += xmm1[2]
;   xmm0[3] += xmm1[3]
; but all at the same time.

This takes the same amount of CPU time as a normal addss instruction which does only a single addition. (The first s is for “scalar”, while p is for “packed”.)

This kind of parallelism is called single-instruction, multiple-data or SIMD for short. We issue a single instruction, but it is applied, in parallel, to multiple operands.

One thing to bear in mind is that, because these operations take the same amount of time as a normal, “unpacked” floating point operation, we can use them even if we only have a single floating-point value to operate on; we just ignore all the other results. There are “packed” operations with no unpacked, single-element equivalents, but you should feel free to use them even in non-parallel code.

Another thing to notice is that many packed operations expect integer operands, or operate directly on the bits of their operands. Generally, these instructions don’t make sense on floating-point operands, except in one case: manipulating the sign bit. In both single- and double-precision floating point values, the sign bit is always the highest bit. Hence, we can use the packed bitwise AND/OR/XOR operations to manipulate the sign of floating-point numbers, without resorting to tricks like multiplying by -1:

; This sets up xmm1 so that only the high bit is set
cmpeqd xmm1, xmm1   ; xmm1 == xmm1? True, sets all bits to 1
pslld  xmm1, 31

xorps xmm0, xmm1 ; xmm0 = -xmm0
orps  xmm0, xmm1 ; xmm0 = -abs(xmm0)
pandn xmm0, xmm1 ; xmm0 = abs(xmm0)

The cmpeqd compare double-word elements in the source and target for equality; each element is set to all 1s if the comparison was true, or all 0s if false. Thus, this sets xmm1 to all 1s.
The pslld performs a left shift of each element. Shifting each element 31 bits to the left effectively shifts in 31 0s, leaving a 1 in the highest bit position (the sign bit’s position).
xorps performs an XOR operation on the bits of each pair of elements. Since only the sign bit is set in xmm1, this effectively flips the sign bit of xmm0.
orps performs an OR operation on the bits of each pair of elements, setting the sign bit of xmm0 to 1.
pandn performs a AND-NOT operation on the bits, first negating xmm1 (so that it is all 1s except for the sign bit, which is 0), and then ANDs this with xmm0, clearing the sign bit while leaving the other bits unchanged.

This kind of thing can also be used to “generate” floating-point constants directly in a register, rather than loading it from memory.

Packed arithmetic operations

The simplest packed (SIMD) operations to understand are the arithmetic operations; these take an arithmetic operation (+,-,*,/) and apply

“Vertical” vs. “Horizontal” operations

We can think of packed operations as “vertical”: we have many values and we want to apply the same operation to them all:

xmm0[0]    xmm0[1]    xmm0[2]    xmm0[3]
   +          +          +          +   
   =          =          =          =
xmm1[0]    xmm1[1]    xmm1[2]    xmm1[3]

This is different from an operation of, say, summing up the elements of a sequence, which we could think of as “horizontal”:

a + b + c + d

In a vertical operation, no addition depends on any other addition, so we can do them all simultaneously, in the time of a single addition. In a horizontal operation, no matter how we break it up, some addition is going to have to wait for the result of another. The best we can do is

(a + b) + (c + d)

but even if we can do a + b at the same time as c + d, the outer addition must wait until both these are done. Horizontal operations have data dependencies in them: one part of the operation must wait for some other part.

Complexity theorists have worked out the amount of speed-up we could expect if we were allowed an infinite amount of parallelism. E.g., if we had a CPU which could perform an infinite number of additions simultaneously, then we could perform a vertical addition, of any size, in constant time, \(O(1)\). But a horizontal addition, because of its data dependencies, will still involve some waiting. In fact, even with infinite parallelism, horizontal, data-dependent operations can, at best, be done in logarithmic, \(O(\log n)\) time. No matter how much parallelism we have, we can never get them down to \(O(1)\).

haddps dest, src        ; add dest[0] = src[0]+src[1], dest[1] = src[2]+src[3]
haddpd dest, src

Note that to sum all four elements, we would need to execute two haddps instructions. hadd and hsub are the only two horizontal operations supported; all other packed operations are vertical.

Moving packed data

The movss/movsd instruction we used for floating-point also have a use for packed data: The low element of the source is copied to the destination, while the remaining elements of the destination are left unchanged. Thus, this can be used to overwrite just the low element.

The simplest packed mov instruction is movups/movupd: this moves 128 bits of data either from memory into a xmm register, from a xmm register into memory, or from one xmm register to another. The data is interpreted as either four single-precision floats or two double-precision floats. The u stands for unaligned; when moving to/from memory, these instructions do not require the destination to be aligned to a multiple of 16 bytes.

movups dest, src
movupd dest, src

If you know that your memory data is aligned to multiples of 16 bytes, an aligned move can be faster:

movaps dest, src
movapd dest, src

If you try to use an aligned mov on an unaligned address, your program will crash.

If you need to load a single value into all elements of a xmm register, you can use vbroadcast:

vbroadcastss dest, mem ; load dword from [mem] into all elements of dest
vbroadcastsd dest, mem ; load qword from [mem] into both elements of dest

Moving within a register

There are a number of instructions that can be used to move the packed values within a register. For example, the movsldup instruction copies the even elements of the source register into both the even and odd elements of the destination, duplicating them. That is,

movsldup xmm0, xmm1

is equivalent to

xmm0[0] = xmm0[1] = xmm1[0]
xmm0[2] = xmm0[3] = xmm1[2]

The movshdup does the same thing, but with the odd elements of the source.

Packed arithmetic

All of the normal floating-point operations have packed variants:

addps dest, src
addpd dest, src
subps dest, src
subpd dest, src
mulps dest, src
mulpd dest, src
divps dest, src
divpd dest, src

Packed conversion

All of the cvt instructions have packed equivalents. When converting to/from integers, they convert not into general-purposes registers but into xmm registers.

Packed comparisons

Packed comparisons work completely differently from normal comparisons. Because the same operation(s) must be applied to all elements of a xmm register, it wouldn’t be useful to write the comparison results to the flags register (which results?) and even if it were, how would we use those results, since (depending on the comparison) we might need to perform different operations on the different elements of the register?

Packed comparisons produce Boolean (0 or 1) results into another xmm register. These results can then use used as masks, to avoid manipulating particular elements, or multiplied, to zero certain elements, etc. This extends the conditional move paradigm where we have a sequence of instructions which is always executed, but portions are conditionally activated/deactivated, depending on data values.

cmpeqps  dest, src           ; ==
cmpltps  dest, src           ; <
cmpleps  dest, src           ; <=
cmpneps  dest, src           ; != 
cmpnltps dest, src           ; >=
cmpnleps dest, src           ; >
cmpordps dest, src           ; True if neither operand is NaN

If the result of the comparison is true, then the corresponding element of dest is set to all 1s, otherwise it is set to all 0s.

(All of the are actually just aliases for the cmpss instruction with various values of its third operand.)

Packed-equivalents to float operations

Operation	Description
`maxps dest, src`	`dest = max(dest, src)`
`minps dest, src`	`dest = min(dest, src)`
`sqrtps dest, src`	`dest = sqrt(src)`
`rcpps dest, src`	`dest = 1/src`
`rsqrtps dest, src`	`dest = 1 / sqrt(src)`
`roundps dest, src, mode`	Round `src` to nearest integer into `dest` with `mode`
`dpps`	Computes the dot product

Packed integer operations

There are a large number of packed integer operations; all these instructions start with p. E.g., paddb adds sixteen (!) byte integers with the normal overflow behavior; paddw adds eight words, paddd adds four double-words, and paddq adds two quad-words. In general, these are not useful on floating-point operands, except with the possibility of the bitwise operations, which can be used to manipulate the sign bit. If you’re very clever, you can manipulate the exponent and mantissa directly, doing things like adding 1 to the exponent to multiply the number by 2.

Example: audio mixing

Almost the last step in the process of producing audio is mixing; we take two (or however many audio streams are playing) arrays containing audio samples and add them together, sample by sample, and then finally scale the result down so that its not too loud. That is, we do

result = (stream1 + stream2) * gain

where gain is a value between 0 and 1. Since this operation has to be done over many samples, and there is no inter-sample dependency, it’s a perfect application for SIMD.

section .data

stream_len: equ                         256

stream1:    TIMES stream_len dd         0.2
stream2:    TIMES stream_len dd         -0.3
output:     TIMES stream_len dd         0.0

scale:      dd                          0.75

section .text
global main
main:
    push rbp
    mov rbp, rsp

    mov rcx, stream_len / 16 ; 16 bytes per xmm
    mov rdi, 0
    vbroadcastss xmm4, [scale]
.begin_loop:

    ; Load stream1 elements into xmm1
    movups xmm1, [stream1 + 8*rdi]

    ; Load stream2 elements into xmm2
    movups xmm2, [stream2 + 8*rdi]

    ; Add and scale
    addss xmm1, xmm2
    mulss xmm1, xmm4

    ; Write to output
    movups [output + 8*rdi], xmm1

    ; rdi += 2
    inc rdi
    inc rdi
    loop .begin_loop

    ; Audio stream is in the output array

    pop rbp
    mov rax, 0
    ret

Instruction-level parallelism

The vector operations described above are an example of instruction-level parallelism; a single instruction causes the CPU to “do” multiple operations. Vector instructions are an example of explicit ILP: we tell the CPU what to do. Modern CPU architectures also exploit implicit ILP, figuring out for themselves when multiple instructions can be “overlapped” and doing so.

This instruction overlapping is done by using an instruction pipeline. Unfortunately, the structure of the pipeline is not part of the public specification for x86 CPUs. It exists, but it’s not required to take on any specific form or behavior, so you can’t really rely on it. In order to study pipelining, we’re going to switch to a different architecture: MIPS. MIPS is an example of a reduced instruction-set architecture; instead of having many complex-but-useful instructions, it has a few simple-but-fast instructions. This simplicity of the instruction set means that the pipeline behavior is part of the CPU’s specification; programmers and compiler writers can rely on it behaving in a particular way.

The classic MIPS pipeline

In the x86 instruction set, any instruction can access memory: the only restriction is that a single instruction cannot both read and write memory (both operands cannot be memory at the same time). This complicates the design of x86 CPUs, as the process of executing any instruction may need to engage the component of the CPU which deals with memory. Even worse, the “stage” in the instruction at which this can happen varies: sometimes it needs to happen at the beginning of the instruction (when reading from memory) and sometimes at the end (when writing to memory).

MIPS simplifies this by imposing a severe restriction: there are only two instructions which access memory, all other instructions operate on registers only. (MIPS had 32 registers to compensate for this.) The two instructions that access memory are

LOD reg, mem: Load from address mem into register reg. mem can be a direct memory address, or a register-indirect address, consisting of a register offset + a constant base address.
STO mem, reg: Store the value in register reg into memory address mem. mem can be a register containing an address.

Note that memory addresses are either immediates (constants) or registers, but not calculations as are allowed on x86. Thus, these two instructions are much simpler than even the mov instruction we are used to.

All other instructions operate purely on registers:

ADD r1, r2, r3, SUB r1, r2 r3: compute r1 = r2 + r3 or r1 = r2 - r3. Multiplication and division, along with all floating-point operations, are special in that they are multi-cycle operations.
Bitwise operations AND, OR, shifts, etc.
Conditional and unconditional branches: J (jump unconditional), BLTZ (conditional jump, less than 0), etc. Jumps to register addresses are supported.

Instruction execution stages

Each and every instruction’s execution is broken down into five stages; not every instruction needs all five (i.e., sometimes some stages are “turned off”) but no stage is every repeated or executed out-of-order. The stages are

Instruction fetch (IF). The instruction, as bytes, is read in from memory or the I-cache. Unlike x86, MIPS instructions are always 4 bytes big. (x86 instructions can be anywhere from 1 to 15 bytes, depending on the instruction and its prefixes.) The instruction is saved on the CPU in a special register. This stage also increments the program counter, the MIPS-equivalent to the instruction pointer register.
Instruction decode (ID). The instruction is decoded, configuring the remaining stages for what needs to be done. This does things like identify the registers used and latch them in to later stages, turn later stages off if they are not needed, etc.
Execute (EX). Performs the actual computation, generally arithmetic or bitwise operations. The execute stage has at most two inputs (connected to the desired registers in ID) and a single output. For memory-accessing instructions LOAD and STORE, the EX stage computes the address by adding the base address immediate to the offset register.
Memory access (MEM). Accesses memory, either loading a value from memory into a register, or a value from a register into memory.
Writeback (WB). The results of the EX stage are written back into the output register.

The MIPS CPU was designed so that each stage would take exactly 1 clock cycle to execute. Thus, every normal instruction would execute in 5 cycles.

A simple execution model would be: every instruction runs through all five cycles, before the next instruction can begin. This would lead to execution like this (of two instructions, I1 followed by I2):

Time:	0	1	2	3	4	5	…
IF	I1					I2
ID		I1					I2
EX			I1
MEM				I1
WB					I1

If we have n instructions to execute, then the number of cycles taken is simply 5n.

Each of these stages is largely independent of the others, hence, there’s no reason why we can’t overlap multiple instructions, so long as they are in different stages. This leads to the idea of pipelining: instead of an instruction waiting until the previous instruction has finished all stages, it only waits until it has finished IF. This allows instructions to overlap, giving us the following execution:

Time:	0	1	2	3	4	5	…
IF	I1	I2
ID		I1	I2
EX			I1	I2
MEM				I1	I2
WB					I1	I2

The two instructions now execute in 6 cycles instead of 10, and the total number of cycles for any program is now \(n + 5 - 1\). It takes us 4 cycles to fill the pipeline at the beginning, but after that each instruction effectively takes one cycle to execute. (If the pipeline stages take less than 1 cycle each to execute, this leads to an effective cycles-per-instruction less than 1.)

Modern CPUs use pipelines to allow overlapping instructions and improve performance without increasing either CPI or clock rate. However, there are a number of situations in which one instruction must wait for another instruction. These pipeline stalls occur in three main contexts:

Structural hazards, where a given execution stage would be needed by two instructions at the same time. This can occur if, e.g., the ID stage implicitly uses the CPU’s arithmetic unit, while it is also used by the EX stage. The normal solution is to add extra units to the CPU.
Data hazards, where a later instruction depends on the results of a previous instruction.
Control hazards, where the path of execution through the program itself depends on a previous instruction.

Structural/data hazards can be triggered by almost any instruction, while control dependencies mostly occur due to branching.

An example of a data hazard would be adding two values together, and then subtracted the 3 from the result. If done as two separate instructions, A1 and S2, the subtraction (S2) must wait for the addition’s (A1’s) WB stage to complete before it can execute its own Load stage:

Time:	0	1	2	3	4	5	6	7
IF	A1	S2
ID		A1	S2
EX			A1	..	..	S2
MEM				A1			S2
WB					A1			S2

Due to the pipeline stall, the two instructions now require 8 cycles to complete.

A similar situation arises if we perform a LOD instruction L1, and then use the loaded value in an arithmetic instruction ADD A2. The ADD must wait for the LOD to complete its WB stage:

Time:	0	1	2	3	4	5	6	7
IF	L1	A2
ID		L1	A2
EX			L1	..	..	A2
MEM				L1			A2
WB					L1			A2

There are a number of solutions to this problem:

Pipeline interlock: this is the solution used above, simply make the pipeline wait until the results are available. This is necessary for instructions which access memory, as there is no “shortcut” to getting the data.
Bypassing: allows the EX stage to send its results directly to the input of another EX stage, rather than waiting until WB. This configuration must be detected by ID and the CPU configured accordingly. This would help in the situation above: because the results of the addition are already on the CPU, we don’t really need to stall, we just need to get the output back to the input.

A control hazard, resulting from (say) a conditional branch, is even worse. Classic MIPS resolves the target of conditional branches as early as possible, during ID. However, this is already one cycle too late, as the next instruction is already being IFetched. If the branch is taken, then we have to purge this instruction from the pipeline, and restart the pipeline at the branch location, a penalty of several cycles. E.g., consider a program

J1:     je T3
I2:     ...

T3:     ... ; Branch target

This leads to the following pipeline timeline:

Time:	0	1	2	3	4	5	6	7
IF	J1	I2		T3
ID		J1	(purge)		T3
EX			J1			T3
MEM				J1			T3
WB					J1			T3

There are several strategies available for dealing with control hazards:

Branch prediction: the CPU may “guess” that branches are always taken, or never taken.
- In the former case, it will always delay one cycle (so that the branch target can be loaded after ID). If the branch is not taken then the pipeline must be purged and restarted.
- In the latter case, there is no delay if the branch is not taken, but if it is, the pipeline must be purged.
- Branch instructions may be tagged with “hints” telling the CPU whether the branch is likely or not.
- The CPU may maintain a branch history for each branch instruction, recording which way the branch went for a small number of recent executions. This history can be used to predict future branches.
Branch delay slot: branch instructions do not take effect immediately. Instead, this instruction after a branch is always executed, even if the branch was taken! It is the responsibility of the programmer/compiler author to ensure that the instruction immediately after a conditional branch is “safe” to execute, regardless of which way the branch went. (In practice, this is very difficult for many programs, so often a do-nothing NO-OP instruction would be used, effectively delaying the pipeline for a few cycles anyway.)

With a delayed branch, the branch doesn’t actually happen until 1 cycle after the branch instruction. This leads to complicated program semantics.

Multi-cycle instructions

Instructions like multiplication, division, and all floating-point operations are difficult in that they require multiple cycles in the EX stage by their very nature. These instructions were handled specially, always delaying the instructions after them.