SIMD: Packed Floating-Point
If the xmm
registers are 128 bits wide, but floating-point values are
either 32 or 64 bits wide, what’s in the remaining bits? The answer is, more
floating-point values! Each xmm
register can be thought of as an array of
4 single-precision or 2 double-precision floating-point values packed into
a single register. All of the above operations operate only on the “low”
value in a register (in array notation, xmm[0]
); they usually just copy
the remaining elements unchanged. However, there are a number of operations
which operate on all elements of the xmm
“array” simultaneously. For
example, the addps
adds all four single-precision floats simultaneously:
addps xmm0, xmm1
; Equivalent to:
; xmm0[0] += xmm1[0]
; xmm0[1] += xmm1[1]
; xmm0[2] += xmm1[2]
; xmm0[3] += xmm1[3]
; but all at the same time.
This takes the same amount of CPU time as a normal addss
instruction which
does only a single addition. (The first s
is for “scalar”, while p
is
for “packed”.)
This kind of parallelism is called single-instruction, multiple-data or SIMD for short. We issue a single instruction, but it is applied, in parallel, to multiple operands.
One thing to bear in mind is that, because these operations take the same amount of time as a normal, “unpacked” floating point operation, we can use them even if we only have a single floating-point value to operate on; we just ignore all the other results. There are “packed” operations with no unpacked, single-element equivalents, but you should feel free to use them even in non-parallel code.
Another thing to notice is that many packed operations expect integer operands, or operate directly on the bits of their operands. Generally, these instructions don’t make sense on floating-point operands, except in one case: manipulating the sign bit. In both single- and double-precision floating point values, the sign bit is always the highest bit. Hence, we can use the packed bitwise AND/OR/XOR operations to manipulate the sign of floating-point numbers, without resorting to tricks like multiplying by -1:
; This sets up xmm1 so that only the high bit is set
cmpeqd xmm1, xmm1 ; xmm1 == xmm1? True, sets all bits to 1
pslld xmm1, 31
xorps xmm0, xmm1 ; xmm0 = -xmm0
orps xmm0, xmm1 ; xmm0 = -abs(xmm0)
pandn xmm0, xmm1 ; xmm0 = abs(xmm0)
The
cmpeqd
compare double-word elements in the source and target for equality; each element is set to all 1s if the comparison was true, or all 0s if false. Thus, this setsxmm1
to all 1s.The
pslld
performs a left shift of each element. Shifting each element 31 bits to the left effectively shifts in 31 0s, leaving a 1 in the highest bit position (the sign bit’s position).xorps
performs an XOR operation on the bits of each pair of elements. Since only the sign bit is set inxmm1
, this effectively flips the sign bit ofxmm0
.orps
performs an OR operation on the bits of each pair of elements, setting the sign bit ofxmm0
to 1.pandn
performs a AND-NOT operation on the bits, first negatingxmm1
(so that it is all 1s except for the sign bit, which is 0), and then ANDs this withxmm0
, clearing the sign bit while leaving the other bits unchanged.
This kind of thing can also be used to “generate” floating-point constants directly in a register, rather than loading it from memory.
Packed arithmetic operations
The simplest packed (SIMD) operations to understand are the arithmetic
operations; these take an arithmetic operation (+,-,*,/
) and apply
“Vertical” vs. “Horizontal” operations
We can think of packed operations as “vertical”: we have many values and we want to apply the same operation to them all:
xmm0[0] xmm0[1] xmm0[2] xmm0[3]
+ + + +
= = = =
xmm1[0] xmm1[1] xmm1[2] xmm1[3]
This is different from an operation of, say, summing up the elements of a sequence, which we could think of as “horizontal”:
a + b + c + d
In a vertical operation, no addition depends on any other addition, so we can do them all simultaneously, in the time of a single addition. In a horizontal operation, no matter how we break it up, some addition is going to have to wait for the result of another. The best we can do is
(a + b) + (c + d)
but even if we can do a + b
at the same time as c + d
, the outer addition
must wait until both these are done. Horizontal operations have
data dependencies in them: one part of the operation must wait for some
other part.
Complexity theorists have worked out the amount of speed-up we could expect if we were allowed an infinite amount of parallelism. E.g., if we had a CPU which could perform an infinite number of additions simultaneously, then we could perform a vertical addition, of any size, in constant time, \(O(1)\). But a horizontal addition, because of its data dependencies, will still involve some waiting. In fact, even with infinite parallelism, horizontal, data-dependent operations can, at best, be done in logarithmic, \(O(\log n)\) time. No matter how much parallelism we have, we can never get them down to \(O(1)\).
haddps dest, src ; add dest[0] = src[0]+src[1], dest[1] = src[2]+src[3]
haddpd dest, src
Note that to sum all four elements, we would need to execute two haddps
instructions. hadd
and hsub
are the only two horizontal operations
supported; all other packed operations are vertical.
Moving packed data
The movss
/movsd
instruction we used for floating-point also have a use
for packed data: The low element of the source is copied to the destination,
while the remaining elements of the destination are left unchanged. Thus, this
can be used to overwrite just the low element.
The simplest packed mov
instruction is movups
/movupd
: this moves 128 bits
of data either from memory into a xmm
register, from a xmm
register into
memory, or from one xmm
register to another. The data is interpreted as
either four single-precision floats or two double-precision floats. The u
stands for unaligned; when moving to/from memory, these instructions do not
require the destination to be aligned to a multiple of 16 bytes.
movups dest, src
movupd dest, src
If you know that your memory data is aligned to multiples of 16 bytes, an aligned move can be faster:
movaps dest, src
movapd dest, src
If you try to use an aligned mov
on an unaligned address, your program will
crash.
If you need to load a single value into all elements of a xmm
register, you
can use vbroadcast
:
vbroadcastss dest, mem ; load dword from [mem] into all elements of dest
vbroadcastsd dest, mem ; load qword from [mem] into both elements of dest
Moving within a register
There are a number of instructions that can be used to move the packed values
within a register. For example, the movsldup
instruction copies the
even elements of the source register into both the even and odd elements of
the destination, duplicating them. That is,
movsldup xmm0, xmm1
is equivalent to
xmm0[0] = xmm0[1] = xmm1[0]
xmm0[2] = xmm0[3] = xmm1[2]
The movshdup
does the same thing, but with the odd elements of the source.
Packed arithmetic
All of the normal floating-point operations have packed variants:
addps dest, src
addpd dest, src
subps dest, src
subpd dest, src
mulps dest, src
mulpd dest, src
divps dest, src
divpd dest, src
Packed conversion
All of the cvt
instructions have packed equivalents. When converting to/from
integers, they convert not into general-purposes registers but into xmm
registers.
Packed comparisons
Packed comparisons work completely differently from normal comparisons. Because
the same operation(s) must be applied to all elements of a xmm
register,
it wouldn’t be useful to write the comparison results to the flags register
(which results?) and even if it were, how would we use those results, since
(depending on the comparison) we might need to perform different operations on
the different elements of the register?
Packed comparisons produce Boolean (0 or 1) results into another xmm register. These results can then use used as masks, to avoid manipulating particular elements, or multiplied, to zero certain elements, etc. This extends the conditional move paradigm where we have a sequence of instructions which is always executed, but portions are conditionally activated/deactivated, depending on data values.
cmpeqps dest, src ; ==
cmpltps dest, src ; <
cmpleps dest, src ; <=
cmpneps dest, src ; !=
cmpnltps dest, src ; >=
cmpnleps dest, src ; >
cmpordps dest, src ; True if neither operand is NaN
If the result of the comparison is true, then the corresponding element of
dest
is set to all 1s, otherwise it is set to all 0s.
(All of the are actually just aliases for the cmpss
instruction with various
values of its third operand.)
Packed-equivalents to float operations
Operation | Description |
---|---|
maxps dest, src |
dest = max(dest, src) |
minps dest, src |
dest = min(dest, src) |
sqrtps dest, src |
dest = sqrt(src) |
rcpps dest, src |
dest = 1/src |
rsqrtps dest, src |
dest = 1 / sqrt(src) |
roundps dest, src, mode |
Round src to nearest integer into dest with mode |
dpps |
Computes the dot product |
Packed integer operations
There are a large number of packed integer operations; all these
instructions start with p
. E.g., paddb
adds sixteen (!) byte integers with
the normal overflow behavior; paddw
adds eight words, paddd
adds four
double-words, and paddq
adds two quad-words. In general, these are not
useful on floating-point operands, except with the possibility of the
bitwise operations, which can be used to manipulate the sign bit. If you’re
very clever, you can manipulate the exponent and mantissa directly, doing
things like adding 1 to the exponent to multiply the number by 2.
Example: audio mixing
Almost the last step in the process of producing audio is mixing; we take two (or however many audio streams are playing) arrays containing audio samples and add them together, sample by sample, and then finally scale the result down so that its not too loud. That is, we do
result = (stream1 + stream2) * gain
where gain
is a value between 0 and 1. Since this operation has to be
done over many samples, and there is no inter-sample dependency, it’s a
perfect application for SIMD.
section .data
stream_len: equ 256
stream1: TIMES stream_len dd 0.2
stream2: TIMES stream_len dd -0.3
output: TIMES stream_len dd 0.0
scale: dd 0.75
section .text
global main
main:
push rbp
mov rbp, rsp
mov rcx, stream_len / 16 ; 16 bytes per xmm
mov rdi, 0
vbroadcastss xmm4, [scale]
.begin_loop:
; Load stream1 elements into xmm1
movups xmm1, [stream1 + 8*rdi]
; Load stream2 elements into xmm2
movups xmm2, [stream2 + 8*rdi]
; Add and scale
addss xmm1, xmm2
mulss xmm1, xmm4
; Write to output
movups [output + 8*rdi], xmm1
; rdi += 2
inc rdi
inc rdi
loop .begin_loop
; Audio stream is in the output array
pop rbp
mov rax, 0
ret
Instruction-level parallelism
The vector operations described above are an example of instruction-level parallelism; a single instruction causes the CPU to “do” multiple operations. Vector instructions are an example of explicit ILP: we tell the CPU what to do. Modern CPU architectures also exploit implicit ILP, figuring out for themselves when multiple instructions can be “overlapped” and doing so.
This instruction overlapping is done by using an instruction pipeline. Unfortunately, the structure of the pipeline is not part of the public specification for x86 CPUs. It exists, but it’s not required to take on any specific form or behavior, so you can’t really rely on it. In order to study pipelining, we’re going to switch to a different architecture: MIPS. MIPS is an example of a reduced instruction-set architecture; instead of having many complex-but-useful instructions, it has a few simple-but-fast instructions. This simplicity of the instruction set means that the pipeline behavior is part of the CPU’s specification; programmers and compiler writers can rely on it behaving in a particular way.
The classic MIPS pipeline
In the x86 instruction set, any instruction can access memory: the only restriction is that a single instruction cannot both read and write memory (both operands cannot be memory at the same time). This complicates the design of x86 CPUs, as the process of executing any instruction may need to engage the component of the CPU which deals with memory. Even worse, the “stage” in the instruction at which this can happen varies: sometimes it needs to happen at the beginning of the instruction (when reading from memory) and sometimes at the end (when writing to memory).
MIPS simplifies this by imposing a severe restriction: there are only two instructions which access memory, all other instructions operate on registers only. (MIPS had 32 registers to compensate for this.) The two instructions that access memory are
LOD reg, mem
: Load from addressmem
into registerreg
.mem
can be a direct memory address, or a register-indirect address, consisting of a register offset + a constant base address.STO mem, reg
: Store the value in registerreg
into memory addressmem
.mem
can be a register containing an address.
Note that memory addresses are either immediates (constants) or registers, but
not calculations as are allowed on x86. Thus, these two instructions are
much simpler than even the mov
instruction we are used to.
All other instructions operate purely on registers:
ADD r1, r2, r3
,SUB r1, r2 r3
: computer1 = r2 + r3
orr1 = r2 - r3
. Multiplication and division, along with all floating-point operations, are special in that they are multi-cycle operations.Bitwise operations
AND
,OR
, shifts, etc.Conditional and unconditional branches:
J
(jump unconditional),BLTZ
(conditional jump, less than 0), etc. Jumps to register addresses are supported.
Instruction execution stages
Each and every instruction’s execution is broken down into five stages; not every instruction needs all five (i.e., sometimes some stages are “turned off”) but no stage is every repeated or executed out-of-order. The stages are
Instruction fetch (IF). The instruction, as bytes, is read in from memory or the I-cache. Unlike x86, MIPS instructions are always 4 bytes big. (x86 instructions can be anywhere from 1 to 15 bytes, depending on the instruction and its prefixes.) The instruction is saved on the CPU in a special register. This stage also increments the program counter, the MIPS-equivalent to the instruction pointer register.
Instruction decode (ID). The instruction is decoded, configuring the remaining stages for what needs to be done. This does things like identify the registers used and latch them in to later stages, turn later stages off if they are not needed, etc.
Execute (EX). Performs the actual computation, generally arithmetic or bitwise operations. The execute stage has at most two inputs (connected to the desired registers in ID) and a single output. For memory-accessing instructions LOAD and STORE, the EX stage computes the address by adding the base address immediate to the offset register.
Memory access (MEM). Accesses memory, either loading a value from memory into a register, or a value from a register into memory.
Writeback (WB). The results of the EX stage are written back into the output register.
The MIPS CPU was designed so that each stage would take exactly 1 clock cycle to execute. Thus, every normal instruction would execute in 5 cycles.
A simple execution model would be: every instruction runs through all five cycles, before the next instruction can begin. This would lead to execution like this (of two instructions, I1 followed by I2):
Time: | 0 | 1 | 2 | 3 | 4 | 5 | … |
---|---|---|---|---|---|---|---|
IF | I1 | I2 | |||||
ID | I1 | I2 | |||||
EX | I1 | ||||||
MEM | I1 | ||||||
WB | I1 |
If we have n instructions to execute, then the number of cycles taken is simply 5n.
Each of these stages is largely independent of the others, hence, there’s no reason why we can’t overlap multiple instructions, so long as they are in different stages. This leads to the idea of pipelining: instead of an instruction waiting until the previous instruction has finished all stages, it only waits until it has finished IF. This allows instructions to overlap, giving us the following execution:
Time: | 0 | 1 | 2 | 3 | 4 | 5 | … |
---|---|---|---|---|---|---|---|
IF | I1 | I2 | |||||
ID | I1 | I2 | |||||
EX | I1 | I2 | |||||
MEM | I1 | I2 | |||||
WB | I1 | I2 |
The two instructions now execute in 6 cycles instead of 10, and the total number of cycles for any program is now \(n + 5 - 1\). It takes us 4 cycles to fill the pipeline at the beginning, but after that each instruction effectively takes one cycle to execute. (If the pipeline stages take less than 1 cycle each to execute, this leads to an effective cycles-per-instruction less than 1.)
Modern CPUs use pipelines to allow overlapping instructions and improve performance without increasing either CPI or clock rate. However, there are a number of situations in which one instruction must wait for another instruction. These pipeline stalls occur in three main contexts:
Structural hazards, where a given execution stage would be needed by two instructions at the same time. This can occur if, e.g., the ID stage implicitly uses the CPU’s arithmetic unit, while it is also used by the EX stage. The normal solution is to add extra units to the CPU.
Data hazards, where a later instruction depends on the results of a previous instruction.
Control hazards, where the path of execution through the program itself depends on a previous instruction.
Structural/data hazards can be triggered by almost any instruction, while control dependencies mostly occur due to branching.
An example of a data hazard would be adding two values together, and then subtracted the 3 from the result. If done as two separate instructions, A1 and S2, the subtraction (S2) must wait for the addition’s (A1’s) WB stage to complete before it can execute its own Load stage:
Time: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
IF | A1 | S2 | ||||||
ID | A1 | S2 | ||||||
EX | A1 | .. | .. | S2 | ||||
MEM | A1 | S2 | ||||||
WB | A1 | S2 |
Due to the pipeline stall, the two instructions now require 8 cycles to complete.
A similar situation arises if we perform a LOD
instruction L1, and then use the
loaded value in an arithmetic instruction ADD
A2. The ADD
must wait for the
LOD
to complete its WB stage:
Time: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
IF | L1 | A2 | ||||||
ID | L1 | A2 | ||||||
EX | L1 | .. | .. | A2 | ||||
MEM | L1 | A2 | ||||||
WB | L1 | A2 |
There are a number of solutions to this problem:
Pipeline interlock: this is the solution used above, simply make the pipeline wait until the results are available. This is necessary for instructions which access memory, as there is no “shortcut” to getting the data.
Bypassing: allows the EX stage to send its results directly to the input of another EX stage, rather than waiting until WB. This configuration must be detected by ID and the CPU configured accordingly. This would help in the situation above: because the results of the addition are already on the CPU, we don’t really need to stall, we just need to get the output back to the input.
A control hazard, resulting from (say) a conditional branch, is even worse. Classic MIPS resolves the target of conditional branches as early as possible, during ID. However, this is already one cycle too late, as the next instruction is already being IFetched. If the branch is taken, then we have to purge this instruction from the pipeline, and restart the pipeline at the branch location, a penalty of several cycles. E.g., consider a program
J1: je T3
I2: ...
T3: ... ; Branch target
This leads to the following pipeline timeline:
Time: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
IF | J1 | I2 | T3 | |||||
ID | J1 | (purge) | T3 | |||||
EX | J1 | T3 | ||||||
MEM | J1 | T3 | ||||||
WB | J1 | T3 |
There are several strategies available for dealing with control hazards:
Branch prediction: the CPU may “guess” that branches are always taken, or never taken.
- In the former case, it will always delay one cycle (so that the branch target can be loaded after ID). If the branch is not taken then the pipeline must be purged and restarted.
- In the latter case, there is no delay if the branch is not taken, but if it is, the pipeline must be purged.
- Branch instructions may be tagged with “hints” telling the CPU whether the branch is likely or not.
- The CPU may maintain a branch history for each branch instruction, recording which way the branch went for a small number of recent executions. This history can be used to predict future branches.
Branch delay slot: branch instructions do not take effect immediately. Instead, this instruction after a branch is always executed, even if the branch was taken! It is the responsibility of the programmer/compiler author to ensure that the instruction immediately after a conditional branch is “safe” to execute, regardless of which way the branch went. (In practice, this is very difficult for many programs, so often a do-nothing NO-OP instruction would be used, effectively delaying the pipeline for a few cycles anyway.)
With a delayed branch, the branch doesn’t actually happen until 1 cycle after the branch instruction. This leads to complicated program semantics.
Multi-cycle instructions
Instructions like multiplication, division, and all floating-point operations are difficult in that they require multiple cycles in the EX stage by their very nature. These instructions were handled specially, always delaying the instructions after them.