Floating-point
For historical reasons, x86-64 has two separate floating-point systems, with some overlap in capabilities between them, but some distinct abilities as well. The two systems are:
The old x87 coprocessor instruction set
The newer XMM vector processing instructions
Either way, floating point operations use a completely different set of registers, as well as a completely different set of operations. Floating-point values are represented in binary in a complete different way from either signed or unsigned values.
Floating point representation
There are three floating-point sizes/representations available to us,
corresponding to float
(32-bit), double
(64-bit) and long double
(80-bit,
stored as 128-bit with 48 unused padding bits).
The representations are similar, the only difference is the number of bits
dedicated to each part of the number. The actual floating point format
(which bit does what) is defined by an international standard called
IEEE-754.
IEEE-754 floating-point formats represent fractional values as a combination of three fields:
A sign bit s, set if the value is negative
An exponent s, represented as a (biased) signed binary value. The value is stored as the actual exponent plus a fixed bias value b. For 32-bit floating point values, the bias is 127. This means that an exponent of 0 is stored internally as
01111111b
, -1 would be01111110b
, 1 would be10000000b
, and so forth.A fractional part called the mantissa, m, which is normally in the range [1,2) (i.e., ≥ 1 but < 2). In a normalized floating-point value, the fractional part is shifted to the left, so that the first set bit is shifted out, as there is almost always an implicit 1. bit to the left of the value. (Shifting the mantissa requires incrementing/decrementing the exponent to preserve the same value.)
0 is handled as a special case.
The value of a floating-point number using these fields is
As an example, consider the value 0.75. In binary this is
0.110000000 (mantissa)
with an exponent of \(2^0\). However, IEEE-754 requires us to shift the mantissa to the left until the bit to the left of the period is set:
1.10000000 (mantissa, = 1.5)
And then we drop the 1.
, because it’s implicit.
To compensate for this, we set the exponent to \(2^{-1}\) (as 1.5 / 2 = 0.75).
In the binary representation, the 1.
part of the mantissa is implicit, so this
would be represented in IEEE-754 format as
Mantissa: .10000000 (= 0.5)
Exponent: 01111110 (= -1)
Sign: 0 (positive)
To recover its decimal value:
Find the decimal value of the mantissa (0.5) and add 1.0 to it. (= 1.5).
Multiply the mantissa’s decimal value by \(2^\text{exponent}\).
If the sign bit is 1, multiply the decimal value by -1.
Representing 0
You might notice that in the above representation, it is impossible to represent 0! I.e., if we do
Mantissa: .00000000 (= 0.0)
...
This is interpreted, with the implicit 1.
as 1.00000000
i.e., 1.
Zero is handled as a special case, by encoding it in the exponent: if the
exponent is 00000000
(= -127 in the biased representation) and mantissa is
also all 0s (= 1.0) then the decimal value is ±0, depending on the sign bit.
(IEEE-754 has both positive and negative zeros.)
Other values of the exponent are used for other special representations:
Exponent =
00000000
and mantissa not all 0s: “subnormal” values. These are values which are technically smaller than the format supports. They are not normally generated by any floating point math, but the FP system may use them internally to get more precise results.Exponent =
11111111
(= 128) and mantissa all 0s: ±Infinity.Exponent =
11111111
and mantissa not all 0s: Not-A-Number.
Because of these special cases, the effective range for exponents is -126 to +127.
Binary representation
The 32-bit float
format is arranged with 8 bits of exponent, 23 bits of
mantissa, and 1 sign bit. The layout is
s | exponent | mantissa | |||||||||||||||||||||||||||||
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
The exponent is
biased by +127, so an exponent of 0 would be stored as 127 (= 01111111b
).
This gives the exponent a range of -127 to +128.
Note that the sign bit is in the high bit, which means that we can perform some
floating-point manipulations relatively easily. E.g., to convert a
floating-point value to its absolute value (sign = 0) just bitwise-AND it with
0111…11
. Similarly, to determine whether a floating point value is negative,
just perform an operation which will load the sign flag (e.g., tst rax, rax
).
Mantissa
The mantissa is stored as a binary fraction; there is an implicit 1 and decimal point at its leftmost end. Thus, the value
10000000000000000000000b
actually represents in decimal
The 1.
is not stored with the value, it is implicit.
Just as when storing integer values in binary, we multiply the bits by increasing powers of 2 and then sum them up, here we multiply the bits by negative and decreasing powers of 2 and sum them up.
Exponent
The exponent effectively allows us to “shift” the bits in the mantissa to the left or the right, while keeping the (implicit) decimal point in the same place. For example, to represent the value 1.0, we would use a mantissa of
0.1000000000000000000000b = 0.5 (1.5 really)
but an exponent of
10000000b = 1 (unbiased)
1.5 * 21 = 3.0 or 11.0000...b
.
Normalization
The combination of exponent and mantissa means that many values have more than one representation. For example, another representation for 1.0 would be
m = 0.010000...b = 0.25
e = 10000001b = 2 (unbiased)
because 0.25 * 22 = 1.0, also. In order to give every representable
value a unique representation, floating-point values are usually normalized.
A normalized floating-point value is one which has been shifted so that its
first non-zero high bit takes the place of the (implicit) 1.
at the high
end. For example, 0.25 = 0.01b
, but normalized this would be shifted to
1.0b
(and then the 1.
would not be stored). While shifting the mantissa,
we also have to decrement the exponent to compensate. So 0.25 would actually
be stored as
00111110100000000000000000000000b
seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm
Note that the binary value of the mantissa is 0, but this actually represents 1.0. The 1 in 01000 has been shifted up to the implicit position. Any exact power/fraction of 2 will have a mantissa that is all 0 bits, and the exponent will be used to shift the (implicit) 1 into the right position.
0 is represented using the smallest possible exponent, -127 (all 0s in the biased representation). Thus, +0 is represented as
00000000000000000000000000000000b
seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm
This has the advantage that all bits are 0, so it will set the zero flag as we would expect.
-0 is the same but with the sign bit set. (IEEE-754 has different representations for negative and positive 0).
“Subnormal” values can be used to store values that are technically smaller than should be possible. For example, the smallest normalized value is
m = 1.00000...b = 1
e = 00000000b = -127
= 2-127.
but using unnormalized values we can represent
m = 0.0000000000000000000001b
e = 00000001b
which is much smaller
“Subnormal” values are used internally to make calculations more accurate, but all the binary floating-point values we’ll ever deal with will be normalized.
Note that because 0 is a special case, there is no “non-normalized” representation of 0. There is a normalized representation of +0, and of -0, and that’s it.
Infinity, NaN, and other special values
IEEE-754 defines representations of positive and negative infinities, as well as “Not-a-Number” values.
Infinity is represented using the largest possible exponent (
11111111
) and a mantissa that is all 0s. The sign bit is used to distinguish positive and negative.NaN is represented using the largest possible exponent, and a mantissa with the highest bit set. The other bits can be used to express information about how the NaN came about (a “signaling” NaN).
There are a number of other unused bit patterns, which do not represent any valid floating-point number. All of these are treated as NaN
64-bit representations
The 64- and 80-bit representations use the same basic principles, only the sizes of the exponent and mantissa fields are different:
Size | C/C++ type | Exponent size | Mantissa size |
---|---|---|---|
32-bit | float |
8 bits | 23 bits |
64-bit | double |
11 | 52 bits |
The 80-bit long double
representation is slightly different: it uses 15 bits
of exponent, 63 bits bits of mantissa and the implicit 1 in the highest bit
of m is actually stored, as a bit between the exponent and mantissa. This
bit is referred to as the integer bit and it allows 0 to have a more natural
representation, as a mantissa of all 0s (which, in IEEE-754 formats, represents
a mantissa of 1.0).
The IEEE-754 standard defines 128- and 256-bit floating-point representations as well, but we won’t worry about them.
x87 Floating Point Instructions
The floating-point instructions that start with f
are part of the older
x87 floating-point instruction set. These use a separate set of floating-point
registers, ST(0)
through ST(7)
, which are treated as a stack. Most of
these instructions take no operands, and operate implicitly on the top
elements of this stack. In YASM, the FP registers are written as st0, st1
,
etc.
The reason for this odd organization is that originally, all floating-point was
handled by a physically-separate coprocessor CPU. The coprocessor was a
separate chip, connected to the bus, so it was able to “listen in”.
Floating-point instructions were dispatched by the CPU to the coprocessor. (If
it was present; the coprocessor was optional; trying to use FP code without it
would trigger an exception.) So the ST(x)
registers
didn’t “live” on the main CPU but on the coprocessor, and hence, in order to be
fast, floating-point calculations had to try to stay on the coprocessor as much
as possible.
These days, the FP register stack lives on the same CPU as everything else,
but for compatibility with old code, it is still treated as separate. A
disadvantage to using x87 instructions is that floating-point arguments to
functions are passed in the xmm
registers, so it takes some work to get them
into the ST
registers used by the x87 subsystem. However, there are some
operations which are only supported by the x87 subsystem, so it may be worth
the effort.
All of the x87 floating point instructions start with f
, and there is some
overlap between them and the xmm
-using instructions. If the operation you
want is supported by both, then the choice as to which to use it up to you;
both are equally optimized these days. Because the xmm
registers are used
for arguments/return values, it may take some extra work to move values to/from
the FP register stack.
Internally, the x87 subsystem stores every value as 80-bit precision. Values
are up/down converted when moving to/from memory as float
s or double
s. This
means that we get the extra precision “for free”. (This is true in general:
high- and low-precision floating-point operations take the same amount of time,
so our only concern is space usage.)
Initialization
emms
is used to initialize the floating point “coprocessor” by resetting its
state. The calling convention requires that the processor be in XMM mode upon
entry to any function, so to be on the safe side we’ll always call emms
in the beginning of any function using the x87 system.
Floating-point register stack
The 8 st
registers, numbered st0
through st7
, are implicitly treated as a
stack,
with st0
on top of the stack. Most x87 instructions use st0
implicitly
as the destination of their operation (e.g., fsub
writes its result into
st0
.) All ST registers are caller-saved. Although these registers are called
a “stack”, you can also rotate them, shifting the value in st0
into
st1
, st1
into st2
, and so forth, finally shifting st7
into
st0
. x87 floating point code basically boils down to managing these
registers efficiently.
st0
is always implicitly the top of the stack, with lower elements going
in ascending order.
Individual entries in the FP stack can either be “in-use” or “free”. Because there are always 8 entries available, “popping” from the stack doesn’t actually remove anything, it just marks the relevant entry as “free” and then rotates.
A pop effectively performs the following two steps:
Rotate all the registers, so
st0
now refers to the value formerly inst1
,1
refers to2
, and so forth.(Internally, the FP subsystem doesn’t actually move anything around; it just maintains a pointer to which register is currently on “top”; rotating just increments this.)
Mark the old
st0
as free.Attempting to pop when
st0
is already free, or in general, using a free registers as a source for any FP operation, is a stack underflow.
A push performs the following steps:
Rotate the stack the opposite direction.
Set
st0
to the value pushed, and mark it as “in-use”.Attempting to push into a register which is already “in-use” is a stack overflow and triggers a FP exception.
Pushing values onto the stack is called loading in x87 terminology; there are instructions for loading from memory in various formats:
fld dword [addr] ; Push float from memory
fld qword [addr] ; Push double from memory
fld st1 ; Push st1 to st0
fild dword [addr] ; Push signed dword integer from memory
fild qword [addr] ; Push signed qword integer from memory
fld1 ; Push +1.0
fldz ; Push +0.0
fldpi ; Push π
Note that all of these push onto the stack, so they not only set st0
to the
value, they shift the old st0
and everything below it down. There is no
way to copy a value from one of the other registers into st0
(fst
, below,
can copy st0
into any of the other registers).
(There are a few other constants which can be pushed; see here for the full list.)
Note that there are no instructions to load immediate floating-point values.
To load a floating point constant, other than those with dedicated instructions,
you have to store it in memory (typically in .data
or .rodata
) and then
load it from there. Some simple constants can be synthesized out of the 1,0
that the fld1
and fldz
instructions support.
Many instructions have a -p form which also pops the stack after performing
the operation. E.g.,
fst st3
copies ST(0)
to ST(3)
, while fstp st3
does the same thing, but
pops afterward.
To make operating on values lower on the stack more convenient, the fxch
instruction exchanges the value in another st
register with st0
. E.g.,
fxch st3 ; Swap st0 with st3
Writing to memory
Writing results from the FP stack back to memory is called a store.
fst
/fstp
is used to move floating-point values from st0
into other
locations in the stack, or from st0
into memory.
fst dword [addr] ; Copy float st0 to [addr]
fst st1 ; Copy st0 to st1
fstp st1 ; Copy st0 to st1 and then pop
Note that for stores-to-memory, the size qualifier (dword
or qword
) is
required, so that the assembler knows whether to copy as float
or double
.
We can also store into integers, by rounding or truncation:
fist dword [addr] ; Write float ST(0) as integer to addr
fistp dword [addr] ; Write float ST(0) as integer and then pop
fisttp qword [addr] ; Write double as trunc. integer and then pop
(64-bit stores can only be done in the -p popping variant.)
Rounding uses the current rounding mode, while truncation just discards any fractional part, effectively rounding towards 0.
Arithmetic operations
Most arithmetic operations are present in several forms:
Single argument (either a
st
register or a memory operand), withst0
as the implicit destination. E.g.,fmul st2 ; st0 = st0 * st2
Two argument, where the both are ST registers, and one of them is
st0
. E.g.,fmul st2, st0 ; st2 = st2 * st0
One of the two registers must be
st0
. You cannot (e.g.) multiplyst2
byst3
.-p
variants, which perform the operation and then popst0
off the stack. Typically this is only useful if the destination is notst0
.
There are no three-argument forms, so you cannot do st0 = st1 + st2
directly.
The main arithmetic operations are
fadd ; Addition
fsub ; Subtraction
fmul ; Multiplication
fdiv ; Division
All of these are available as either op stx
(meaning st0 = st0 op stx
),
op st0, stx
(meaning the same thing), or
op stx, st0
(meaning stx = stx op st0
).
Division and subtraction are also available in “reverse” (fsubr
, fdivr
)
forms, which instead of computing st0 = st0 - st(x)
compute
st0 = st(x) - st(0)
. There are also integer variants which read the second
operand as an integer from memory.
Floating-point function arguments
Because the ABI requires that normal floating point arguments be passed in the
xmm
registers, it takes some work to get them into the FP stack. We have to
move them into memory (typically, local space allocated on the stack) and then
from there load into the FP stack. E.g., in a function taking two
float
arguments:
func:
; xmm0 = argument 1
; xmm1 = argument 2
push rbp
mov rbp, rsp
movsd qword [rsp-8], xmm0
movsd qword [rsp-16], xmm1
; Switch to x87 mode
emms
fld qword [rsp-8] ; Push xmm0
fld qword [rsp-16] ; Push xmm1
; Now arg 1 is in st1, arg 2 is in st0
...
add rsp, 16
pop rbp
ret
(Note that we are using the “red zone”, the 128 bytes above rsp
, to do the
transfer, to avoid having to explicitly allocate space on the stack. This
kind of “temporary memory” usage is exactly why the red zone exists.)
Returning a floating-point value, when the value exists in the FP stack,
similarly involves a round-trip through memory in order to get it into
xmm0
. E.g., to return st0
:
fstp qword [rsp-8] ; pop from st0 onto the top of the stack
movsd xmm0, [rsp-8] ; load top of stack into xmm0
again using the red zone as temporary storage.
When returning from a function which uses x87 floating point, you are required to free (i.e., mark as empty) all FP registers, typically by doing
fstp st0
fstp st1
...
Floating-point comparisons
The fcomi
instruction compares two FP stack elements (the first of which
must be st0
) and updates
the flags register as usual as for an unsigned comparison. (The x87 system had its own internal flags register,
which the original fcom
instruction would update; it would then take some
work to get the x87 flags into the normal flags register. ) The normal
conditional jump instruction can then
be used. E.g., to copy the larger of st0
and st1
into st0
:
fcomi st0, st1 ; Compare st0, st1
jge .done ; Jump to .done if st0 >= st1
fld st1 ; Otherwise push st0 = st1
.done:
The fcmov__
family performs conditional (floating-point) moves, similar to the
cmov__
family.
E.g., to copy the larger of st0
and st1
into st1
:
fcomi st0, st1 ; Compare st0 with st1
fcmovl st1 ; Set st0 = st1 if st0 < st1
Ironically, there is no unconditional FP move instruction which does the
same thing (sets st0
to the value from another st
register)! All we can
do is push.
(The XMM floating-point system has a single instruction which finds the maximum of two values.)
fcomip
does the comparison and then pops.
fucomi
does an “unordered” comparison; the result of an unordered comparison
is different when one of the operands is Not-A-Number:
In an ordered comparison, comparisons between NaN and other numbers result in a floating-point exception.
In an unordered comparison, comparisons between NaN and other numbers always give a true result. I.e.,
NaN
counts as less than, greater than, equal to, and not equal to every other number!
Normally we will prefer the unordered comparison, as it is slightly faster and means we don’t have to deal with floating-point exceptions.
To compare to the constant 0.0, use ftst
, which compares st0
with 0. This
lets you avoid having to load the constant 0 into one of the other registers
for the comparison.
Mathematical functions
The x87 subsystem has single-instruction implementations of many common mathematical functions. (Most of these functions are missing from the XMM system, which is one reason why you might want to use the x87 system.)
; Trig functions
fsin ; st0 = sin(st0)
fcos ; st0 = cos(st0)
fsincos ; st0 = sin(st0), push cos(st0)
fptan ; st0 = tan(st0), push 1.0
fpatan ; st1 = atan(st1 / st0) and then pop
fsqrt ; st0 = sqrt(st0)
fprem1 ; st0 = fmod(st0, st1) (fractional remainder)
fabs ; st0 = |st0| (absolute value)
fyl2x ; st1 = st1 × log₂(st0)
f2xm1 ; st0 = 2^st0 - 1
Log/exponents base 2 are used because those are (relatively) easy to compute, given the binary format used for floating-point numbers. Other bases can be used by using the familiar logarithmic identities.
Switching between x87 and XMM modes
When using the x87 system, we have to issue the emms
instruction to clear
the XMM state. When using the XMM system, no specific “initialization”
instruction is necessary, although you should explicitly set the values of any
XMM registers you are using, and not assume that their values will be 0.
Floating-point example: computing π
Here we will use the well-known series approximation for π:
We’ll continue the series until the difference between successive values is
less than 0.000001 (i.e., 5 digits of precision), and then print the result
using printf
from the C standard library.
We’ll use the x87 floating-point system for this example, although because
printf
expects its floating-point arguments to be in the xmm
registers,
we will have to move the value being calculated from the x87 stack into memory,
and from there into xmm0
.
In C, the function we’re going to write looks like
double pi() {
double p = 0.0; // Current pi approximation
double pp = 0.0; // Previous pi approximation
double s = 1; // Sign: +1 or -1
double d = 1; // Denominator: 1,3,5,...
do {
pp = p;
p += s / d;
s *= -1;
d += 2;
} while(abs(p - pp) > 0.000001);
return 4 * p;
}
(Note that p - pp == s / d
.)
Although it’s possible, with some cleverness, to write this procedure so that
it runs entirely on the x87 stack, for maximum speed, the easiest way to write
it is to store the variables on the stack (in the red zone), and then load/store
them as needed. We will of course, have to save the floating point constants
2 and 0.000001 in the .data
section:
section .data
ZERO: dq 0.0
ONE: dq 1.0
TWO: dq 2.0
EPSILON: dq 0.000001
section .text
pi:
push rbp
mov rbp, rsp
; In the red zone
; [rsp - 8] = p = 0.0
; [rsp - 16] = s = 1
; [rsp - 24] = d = 1
; [rsp - 32] = s / d
mov rax, qword [ZERO]
mov qword [rsp - 8], rax
mov rax, qword [ONE]
mov qword [rsp - 16], rax
mov qword [rsp - 24], rax
.begin_loop:
; Update p += s / d
fld qword [rsp - 16] ; = s
fdiv qword [rsp - 24] ; = s / d
fst qword [rsp - 32] ; Save s / d for later
fadd qword [rsp - 8] ; = s / d + p
fstp qword [rsp - 8] ; Write back
ffree st0
; Update s = -s
fld qword [rsp - 16] ; st0 = s
fchs ; st0 = -st0
fst qword [rsp - 16] ; Write back
ffree st0
; Update d += 2
fld qword [TWO]
fld qword [rsp - 24]
fadd st0, st1 ; st0 += 2
fstp qword [rsp - 24] ; Write back
ffree st0
fld qword [rsp - 32]
fabs
fld qword [EPSILON] ; st0 = 0.000001, st1 = |s / d|
fucomi st1 ; Compare st0, st1
ffree st0 ; Remove st0, st1 from the stack
ffree st1
jb .begin_loop ; loop if 0.0000001 > |s / d|
; Multiply result by 4
fld qword [TWO]
fld qword [rsp - 8]
fmul st0, st1
fmul st0, st1
fstp qword [rsp - 8]
ffree st0
; Copy last p into xmm0
movsd xmm0, qword [rsp - 8]
pop rbp
ret
A note about clearing (ffree
) stack elements: it’s important to ensure
that the x87 stack never has more than 8 elements on it. If you push more
than 8 things without popping, ST(8)
will eventually wrap around to ST(0)
and the stack will get corrupted. This will often manifest as values “magically”
becoming NaNs. E.g., if we didn’t perform the ffree
instructions at the
end of the loop, the size of the stack would grow unbounded until it eventually
overwrote itself.
XMM Floating-point instructions
The modern floating point unit on the CPU, with its dedicated registers, shares resources with the vector (SIMD) unit (and with the old x87 unit). In fact, most modern floating-point is actually vectorized floating-point code. It’s rare for real code to operate on one value at a time, as we will do.
Registers
There are 16 floating-point registers, named xmm0
through xmm15
. The size
of these registers is actually 128 bits each, but we will use only the
32- or 64-bit portions, corresponding to the floating-point size we are
working with. Generally speaking, the suffix s
is used to indicate
single-precision (float
), while d
is used to indicate double-precision
(double
).
xmm0
through xmm7
are used for passing floating-point arguments.
xmm8-15
are temporary (caller-saved) registers.
Moving floating-point values
movss dest, src ; Move floats
movsd dest, src ; Move doubles
Two dedicated mov
instructions are used to move floating-point values around.
As usual, both operands must be of the same size, and the size must match
the instruction. The destination and source can be memory or (xmm
) register,
but both cannot be memory at the same time. Different from the normal
mov
instruction, the source cannot be an immediate; to load a constant
floating point value into a register, it must exist in memory already. Note
that register operands must be floating-point registers.
Storing floating-point values in .data
Use dd
(double-word) to store a 32-bit floating-point value, and use
dq
(quad-word) to store a 64-bit value. E.g.,
section .data
pi: dq 3.14159
Floating-point conversions
Instructions exists to convert between the 32- and 64-bit formats, and to convert integers to floating point values, and vice versa.
Instruction | Description |
---|---|
cvtss2sd dest, src |
Convert 32-bit to 64-bit floatsrc can be memory, float register, or a general-purpose register |
cvtsd2ss dest, src |
Convert 64-bit to 32-bit float |
cvtss2si dest, src |
Convert 32-bit float to 32-bit signed integerdest can be a float or general purpose register |
cvtsd2si dest, src |
Convert 64-bit float to 32-bit integer |
cvtsi2ss dest, src |
Convert 32-bit signed integer to 32-bit floatdest must be a float register |
cvtsi2ss dest, src |
Convert 32-bit signed integer to 64-bit float |
Arithmetic
addss dest, src ; dest += src (float)
addsd dest, src ; dest += src (double)
subss dest, src ; dest -= src (float)
subsd dest, src ; dest -= src (double)
mulss dest, src ; dest *= src (float)
mulsd dest, src ; dest *= src (double)
divss dest, src ; dest /= src (float)
divsd dest, src ; dest /= src (double)
Three-operand versions of all of these are available with a v
prefix:
vaddss dest, src1, src2 ; dest = src1 + src2
vaddsd dest, src1, src2 ; dest = src1 + src2
vsubss dest, src1, src2 ; dest = src1 + src2
vsubsd dest, src1, src2 ; dest = src1 + src2
vmulss dest, src1, src2 ; dest = src1 + src2
vmulsd dest, src1, src2 ; dest = src1 + src2
vdivss dest, src1, src2 ; dest = src1 + src2
vdivsd dest, src1, src2 ; dest = src1 + src2
The dest
and src
operands must be xmm
registers; src2
can be register
or memory. Sizes of all operands must be the same.
A number of more advanced mathematical operations are provided as instructions:
Instruction | Description |
---|---|
sqrtss dest, src sqrtsd dest, src |
dest = sqrt(src) |
rcpss dest, src |
dest = 1/src |
rsqrtss dest, src |
dest = 1/sqrt(src) |
maxss dest, src maxsd dest, src |
dest = maximum of dest, src |
minss dest, src minsd dest, src |
dest = minimum of dest, src |
roundss dest, src, mode |
Round src into dest using mode mode = 0 ties go to evenmode = 1 round downmode = 2 round upmode = 3 round toward 0 |
(But note that the trigonometric functions which were available using x87 are not available here.)
Floating-point comparisons
A special floating-point compare instruction is used to compare two
floating-point operands, however, the results are written into the normal
flags register, as if for an unsigned comparison. Thus, the conditional jump
instructions je
, jne
, ja
, jae
, jb
, and jbe
can be used to jump
if equal, not equal, greater-than, greater-than-or-equal, less-than, and
less-than-or-equal, respectively.
ucomiss dest, src ; Compare dest and src (dest must be float reg.),
ucomisd dest, src ; setting rif as for an unsigned comparison
The floating-point system possesses its own flags/status register called
mxcsr
. The flags in here, rather than being set by floating point
instructions, are set by us before floating-point instructions and control
global behavior such as rounding modes, whether divide-by-zero should produce
an exception or just infinity, etc.
As with x87, we are performing an unordered comparison here, meaning that any comparison in which one of the operands is Not-A-Number will give a true result.
Floating-point arguments to functions
The System-V calling convention specifies that the first eight floating-point
arguments to a function are passed in registers xmm0
through xmm7
. The
result, if any, is returned in xmm0
. All of the floating-point registers
are caller-preserved, which means you must push them onto the stack before
calling any function.
An important thing to note: if you’re using printf
, the %f
format specifier
actually requires a double
(64-bit) argument, and not a float
. When we use
printf
in the example, we’ll have to use cvtss2sd
to convert our float
value to double for printing.
Functions like printf
that take a variable number of arguments require an
additional change: the number of arguments in xmm
registers must be set in
al
.
Floating-point example: computing π
Here we will use the well-known series approximation for π:
We’ll continue the series until the difference between successive values is
less than 0.000001 (i.e., 5 digits of precision), and then print the result
using printf
from the C standard library.
In C, the function we’re going to write looks like
float pi() {
float p = 0.0; // Current pi approximation
float s = 1; // Sign: +1 or -1
float d = 1; // Denominator: 1,3,5,...
float rd; // 1 / d
do {
rd = 1 / d;
p += s * rd;
s *= -1;
d += 2;
} while(abs(sd) > 0.000001);
return 4 * p;
}
Unlike the x87 version, there are no instructions to flip the sign of a xmm
register, or to take the absolute value of an xmm
register. Hence, we jump
through a few hoops:
Flipping the sign corresponds to multiplying by -1.
Instead of taking the absolute value, we compute
1/d
and save its value for the comparison, before multiplying by the alternating sign.
Because the xmm
registers are used for both integer and floating point math,
it’s possible to fake both abs
and a sign-flip by directly modifying the bits
in the register:
SIGN_BIT: equ (1 << 63)
pxor xmm0, qword [SIGN_BIT] ; Flip sign bit
pandn xmm0, qword [SIGN_BIT] ; Clear sign bit
Translating this into assembly gives us a straightforward loop:
section .data
zero: dd 0.0
one: dd 1.0
two: dd 2.0
four: dd 4.0
negone: dd -1.0
limit: dd 0.000001
format: db "%f", 10, 0
section .text
extern printf
global main
main:
push rbp
mov rbp, rsp
;; Compute pi
call compute_pi
; Return value in xmm0
;; Print result
mov rdi, format
mov al, 1
cvtss2sd xmm0, xmm0 ; Convert to double for printf
call printf
mov rax, 0
pop rbp
ret
compute_pi:
push rbp
mov rbp, rsp
movss xmm7, dword [one] ; 1.0
movss xmm0, dword [zero] ; p = 0
movss xmm1, xmm7 ; s = 1
movss xmm2, xmm7 ; d = 1
; xmm3 = t
.loop:
movss xmm3, xmm7 ; t = 1
divss xmm3, xmm2 ; t /= d
vmulss xmm4, xmm1, xmm3 ; xmm4 = s * t
addss xmm0, xmm4 ; p += s * t
mulss xmm1, dword [negone] ; s *= -1
addss xmm2, dword [two] ; d += 2
ucomiss xmm3, dword [limit] ; while(t > limit)
ja .loop
; Result is in xmm0
mulss xmm0, dword [four]
pop rbp
ret
When we call printf
we have to make one small adjustment: functions that
take a variable number of arguments (like printf
) require that we set
al
to the number of arguments passed in xmm
registers. It doesn’t have
to be exact, but al
should be ≥ the number of xmm
registers used for
arguments. Until now, we’ve never passed anything in xmm
, so this didn’t
matter. Now, however, we have to set al = 1
before calling printf
, so that
printf
will know how many to use.
More Floating-point
We focus on the two floating-point formats which correspond to float
(32-bit)
and double
(64-bit).
FP Sizes: Most FP instructions exist in ss
(single-precision, i.e., float
)
and sd
(double-precision, double
) forms. E.g., addition is addss
for
32-bit, addsd
for 64-bit. We have to explicitly specify the size we want to
operate on, because the registers are all the same size (as opposed to
rax
, eax
, ax
, etc.). 32-bit corresponds to a dword
, while 64-bit is a
qword
.
Immediates: None of the FP instructions take immediate values. To get
constant values into a FP register, you will have to store it in your .data
section and then movss/movsd
it into the register. Alternatively, you could
store an integer immediate into one of the general-purpose registers, and
then use the cvtsi2**
instructions to convert it to floating-point.
Registers: Floating-point values are stored and manipulated exclusively
in the floating-point registers xmm0
-xmm15
. Almost all floating-point
instructions require at least the destination operand to be a FP register.
All FP registers are 128 bits wide.
Data movement: movss
,movsd
are used to move data into, between, and
out of the xmm
registers. Both operands must be the size specified in the
instruction (ss
or sd
). Memory-to-memory moves are not allowed.
Conversion: Conversion between single- and double-precision, as well as
between both and integers, is accomplished by the cvt*
family of
instructions:
Instruction | Description |
---|---|
cvtss2sd dest, src |
Convert 32-bit to 64-bit floatsrc can be memory, float register, or a general-purpose register |
cvtsd2ss dest, src |
Convert 64-bit to 32-bit float |
cvtss2si dest, src |
Convert 32-bit float to 32-bit signed integerdest can be a FP or general purpose register |
cvtsd2si dest, src |
Convert 64-bit float to 32-bit integer |
cvtsi2ss dest, src |
Convert 32-bit signed integer to 32-bit floatdest must be a FP register |
cvtsi2ss dest, src |
Convert 32-bit signed integer to 64-bit float |
The source and destination registers can be the same; you can
cvtss2sd xmm0, xmm0
to promote the value in xmm0
from float
to double
.
(Note: if you’re trying to printf
a FP register, printf
only accepts
double-precision inputs, so you’ll have to cvtss2sd
if you’re using
single-precision.)
Arithmetic: All arithmetic operations follow the pattern OPss dest, src
and have the interpretation dest = dest OP src
. There’s no weirdness about
multiplication or division.
addss dest, src ; dest += src (float)
addsd dest, src ; dest += src (double)
subss dest, src ; dest -= src (float)
subsd dest, src ; dest -= src (double)
mulss dest, src ; dest *= src (float)
mulsd dest, src ; dest *= src (double)
divss dest, src ; dest /= src (float)
divsd dest, src ; dest /= src (double)
All of these are also available in three-operand forms:
vaddss dest, src1, src2 ; dest = src1 + src2
vaddsd dest, src1, src2 ; dest = src1 + src2
vsubss dest, src1, src2 ; dest = src1 - src2
vsubsd dest, src1, src2 ; dest = src1 - src2
vmulss dest, src1, src2 ; dest = src1 * src2
vmulsd dest, src1, src2 ; dest = src1 * src2
vdivss dest, src1, src2 ; dest = src1 / src2
vdivsd dest, src1, src2 ; dest = src1 / src2
Other operations: Square-roots, minimum/maximum, reciprocal, and rounding are available as instructions:
Instruction | Description |
---|---|
sqrtss dest, src sqrtsd dest, src |
dest = sqrt(src) |
rcpss dest, src |
dest = 1/src |
rsqrtss dest, src |
dest = 1/sqrt(src) |
maxss dest, src maxsd dest, src |
dest = maximum of dest, src |
minss dest, src minsd dest, src |
dest = minimum of dest, src |
roundss dest, src, mode |
Round src to integer into dest using mode mode = 0 ties go to evenmode = 1 round downmode = 2 round upmode = 3 round toward 0 |
Reciprocal, and reciprocal-of-square-root are only available for 32-bit operands. The rounding mode must be an immediate.
Comparisons: Floating-point comparisons are done using ucomiss
,
ucomisd
. These write the results of the comparison into the normal flags
registers as if for a unsigned integer comparison. I.e., equality is the
e
condition code, greater-than is the a
condition code, etc. Normal
conditional jumps are used. (There are no conditional moves for floating-point
values.)
Functions: When calling a function which takes FP arguments, place the
first eight arguments in xmm0-xmm7
. Functions that return a FP result
should place it in xmm0
.
Variadic functions like printf
expect the number of floating-point
arguments to be stored in rax
.
Note that I misunderstood the emms
/femms
instruction: it is only required
if you are using MMX register/instructions, not the newer xmm
registers.
So you don’t have to worry about that.
Floating-point example: computing π
Here we will use the well-known series approximation for π:
We’ll continue the series until the difference between successive values is
less than 0.000001 (i.e., 5 digits of precision), and then print the result
using printf
from the C standard library.
In C, the function we’re going to write looks like
float pi() {
float p = 0.0; // Current pi approximation
float pp = 0.0; // Previous pi approximation
float s = 1; // Sign: +1 or -1
float d = 1; // Denominator: 1,3,5,...
do {
pp = p;
p += s / d;
s *= -1;
d += 2;
} while(abs(p - pp) > 0.000001);
return 4 * p;
}
I’m going to make a small change to eliminate the call to abs
. The difference
pp - p
is just the most recent term in the sum, so I’m going to save it
in another variable:
float pi() {
float p = 0.0; // Current pi approximation
float s = 1.0; // Sign: +1 or -1
float d = 1.0; // Denominator: 1,3,5,...
float t; // Last term in the sum
do {
t = 1 / d;
p += s * t;
s *= -1;
d += 2;
} while(t > 0.000001);
return 4 * p;
}
Translating this into assembly gives us a straightforward loop:
section .data
zero: dd 0.0
one: dd 1.0
two: dd 2.0
four: dd 4.0
negone: dd -1.0
limit: dd 0.000001
format: db "%f", 10, 0
section .text
extern printf
global main
main:
push rbp
mov rbp, rsp
;; Compute pi
call compute_pi
; Return value in xmm0
;; Print result
mov rdi, format
mov al, 1
cvtss2sd xmm0, xmm0 ; Convert to double for printf
call printf
mov rax, 0
pop rbp
ret
compute_pi:
push rbp
mov rbp, rsp
movss xmm7, dword [one] ; 1.0
movss xmm0, dword [zero] ; p = 0
movss xmm1, xmm7 ; s = 1
movss xmm2, xmm7 ; d = 1
; xmm3 = t
.loop:
movss xmm3, xmm7 ; t = 1
divss xmm3, xmm2 ; t /= d
vmulss xmm4, xmm1, xmm3 ; xmm4 = s * t
addss xmm0, xmm4 ; p += s * t
mulss xmm1, dword [negone] ; s *= -1
addss xmm2, dword [two] ; d += 2
ucomiss xmm3, dword [limit] ; while(t > limit)
ja .loop
; Result is in xmm0
mulss xmm0, dword [four]
pop rbp
ret
When we call printf
we have to make one small adjustment: functions that
take a variable number of arguments (like printf
) require that we set
al
to the number of arguments passed in xmm
registers. It doesn’t have
to be exact, but al
should be ≥ the number of xmm
registers used for
arguments. Until now, we’ve never passed anything in xmm
, so this didn’t
matter. Now, however, we have to set al = 1
before calling printf
, so that
printf
will know how many to use.
In-class group exercise
Now that we’ve implemented the easy way to compute π, I’ll let you work in groups to compute it faster, using a different approximation series:
This series converges much faster than the traditional one: while the
traditional series takes nearly 500,00 iterations to get just five digits of
precision, this series gets there in about 12. The other change we’re going to
make is for the approximation limit to be a parameter to the function, rather
than a fixed constant, and for the function to printf
its results after
each iteration, so we can see the convergence happening. Here’s a
skeleton program for you:
section .data
zero: dq 0.0
one: dq 1.0
two: dq 2.0
four: dq 4.0
negone: dq -1.0
limit: dq 0.000001
format: db "%f", 10, 0
section .text
extern printf
global main
main:
push rbp
mov rbp, rsp
;; Compute pi
movsd xmm0, qword [limit]
call compute_pi
; Return value in xmm0
;; Print result
mov rdi, format
mov al, 1
call printf
mov rax, 0
pop rbp
ret
compute_pi:
push rbp
mov rbp, rsp
; xmm0 = Approximation limit
; Return result in xmm0
; YOUR CODE HERE
pop rbp
ret
Note that I’ve changed all the constants to qwords
(i.e., double
s), so
don’t forget to use the double-precision versions of the various operations.
It should be possible to change the limit
in the .text
section to get more
(or less) precise approximations.