Floating-point

For historical reasons, x86-64 has two separate floating-point systems, with some overlap in capabilities between them, but some distinct abilities as well. The two systems are:

The old x87 coprocessor instruction set
The newer XMM vector processing instructions

Either way, floating point operations use a completely different set of registers, as well as a completely different set of operations. Floating-point values are represented in binary in a complete different way from either signed or unsigned values.

Floating point representation

There are three floating-point sizes/representations available to us, corresponding to float (32-bit), double (64-bit) and long double (80-bit, stored as 128-bit with 48 unused padding bits). The representations are similar, the only difference is the number of bits dedicated to each part of the number. The actual floating point format (which bit does what) is defined by an international standard called IEEE-754.

IEEE-754 floating-point formats represent fractional values as a combination of three fields:

A sign bit s, set if the value is negative
An exponent s, represented as a (biased) signed binary value. The value is stored as the actual exponent plus a fixed bias value b. For 32-bit floating point values, the bias is 127. This means that an exponent of 0 is stored internally as 01111111b, -1 would be 01111110b, 1 would be 10000000b, and so forth.
A fractional part called the mantissa, m, which is normally in the range [1,2) (i.e., ≥ 1 but < 2). In a normalized floating-point value, the fractional part is shifted to the left, so that the first set bit is shifted out, as there is almost always an implicit 1. bit to the left of the value. (Shifting the mantissa requires incrementing/decrementing the exponent to preserve the same value.)

0 is handled as a special case.

The value of a floating-point number using these fields is

$$(-1 s)\; (1+m) \;2^{e-b}$$

As an example, consider the value 0.75. In binary this is

0.110000000 (mantissa)

with an exponent of $2^0$. However, IEEE-754 requires us to shift the mantissa to the left until the bit to the left of the period is set:

1.10000000 (mantissa, = 1.5)

And then we drop the 1., because it’s implicit. To compensate for this, we set the exponent to $2^{-1}$ (as 1.5 / 2 = 0.75). In the binary representation, the 1. part of the mantissa is implicit, so this would be represented in IEEE-754 format as

Mantissa: .10000000 (= 0.5)
Exponent:  01111110 (= -1)
Sign:      0        (positive)

To recover its decimal value:

Find the decimal value of the mantissa (0.5) and add 1.0 to it. (= 1.5).
Multiply the mantissa’s decimal value by $2^\text{exponent}$.
If the sign bit is 1, multiply the decimal value by -1.

Representing 0

You might notice that in the above representation, it is impossible to represent 0! I.e., if we do

Mantissa: .00000000 (= 0.0)
...

This is interpreted, with the implicit 1. as 1.00000000 i.e., 1.

Zero is handled as a special case, by encoding it in the exponent: if the exponent is 00000000 (= -127 in the biased representation) and mantissa is also all 0s (= 1.0) then the decimal value is ±0, depending on the sign bit. (IEEE-754 has both positive and negative zeros.)

Other values of the exponent are used for other special representations:

Exponent = 00000000 and mantissa not all 0s: “subnormal” values. These are values which are technically smaller than the format supports. They are not normally generated by any floating point math, but the FP system may use them internally to get more precise results.
Exponent = 11111111 (= 128) and mantissa all 0s: ±Infinity.
Exponent = 11111111 and mantissa not all 0s: Not-A-Number.

Because of these special cases, the effective range for exponents is -126 to +127.

Binary representation

The 32-bit float format is arranged with 8 bits of exponent, 23 bits of mantissa, and 1 sign bit. The layout is

s	exponent								mantissa
31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0

The exponent is biased by +127, so an exponent of 0 would be stored as 127 (= 01111111b). This gives the exponent a range of -127 to +128.

Note that the sign bit is in the high bit, which means that we can perform some floating-point manipulations relatively easily. E.g., to convert a floating-point value to its absolute value (sign = 0) just bitwise-AND it with 0111…11. Similarly, to determine whether a floating point value is negative, just perform an operation which will load the sign flag (e.g., tst rax, rax).

Mantissa

The mantissa is stored as a binary fraction; there is an implicit 1 and decimal point at its leftmost end. Thus, the value

10000000000000000000000b

actually represents in decimal

$$1 \times 2^0 + 1 \times 2^{-1} + 0 \times 2^{-2} + \cdots = 1.5$$

The 1. is not stored with the value, it is implicit.

Just as when storing integer values in binary, we multiply the bits by increasing powers of 2 and then sum them up, here we multiply the bits by negative and decreasing powers of 2 and sum them up.

Exponent

The exponent effectively allows us to “shift” the bits in the mantissa to the left or the right, while keeping the (implicit) decimal point in the same place. For example, to represent the value 1.0, we would use a mantissa of

0.1000000000000000000000b = 0.5 (1.5 really)

but an exponent of

10000000b = 1 (unbiased)

1.5 * 2¹ = 3.0 or 11.0000...b.

Normalization

The combination of exponent and mantissa means that many values have more than one representation. For example, another representation for 1.0 would be

m = 0.010000...b = 0.25
e = 10000001b    = 2 (unbiased)

because 0.25 * 2² = 1.0, also. In order to give every representable value a unique representation, floating-point values are usually normalized. A normalized floating-point value is one which has been shifted so that its first non-zero high bit takes the place of the (implicit) 1. at the high end. For example, 0.25 = 0.01b, but normalized this would be shifted to 1.0b (and then the 1. would not be stored). While shifting the mantissa, we also have to decrement the exponent to compensate. So 0.25 would actually be stored as

00111110100000000000000000000000b
seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm

Note that the binary value of the mantissa is 0, but this actually represents 1.0. The 1 in 01000 has been shifted up to the implicit position. Any exact power/fraction of 2 will have a mantissa that is all 0 bits, and the exponent will be used to shift the (implicit) 1 into the right position.

0 is represented using the smallest possible exponent, -127 (all 0s in the biased representation). Thus, +0 is represented as

00000000000000000000000000000000b
seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm

This has the advantage that all bits are 0, so it will set the zero flag as we would expect.

-0 is the same but with the sign bit set. (IEEE-754 has different representations for negative and positive 0).

“Subnormal” values can be used to store values that are technically smaller than should be possible. For example, the smallest normalized value is

m = 1.00000...b = 1
e = 00000000b = -127

= 2^-127.

but using unnormalized values we can represent

m = 0.0000000000000000000001b
e = 00000001b

which is much smaller

“Subnormal” values are used internally to make calculations more accurate, but all the binary floating-point values we’ll ever deal with will be normalized.

Note that because 0 is a special case, there is no “non-normalized” representation of 0. There is a normalized representation of +0, and of -0, and that’s it.

Infinity, NaN, and other special values

IEEE-754 defines representations of positive and negative infinities, as well as “Not-a-Number” values.

Infinity is represented using the largest possible exponent (11111111) and a mantissa that is all 0s. The sign bit is used to distinguish positive and negative.
NaN is represented using the largest possible exponent, and a mantissa with the highest bit set. The other bits can be used to express information about how the NaN came about (a “signaling” NaN).

There are a number of other unused bit patterns, which do not represent any valid floating-point number. All of these are treated as NaN

64-bit representations

The 64- and 80-bit representations use the same basic principles, only the sizes of the exponent and mantissa fields are different:

Size	C/C++ type	Exponent size	Mantissa size
32-bit	`float`	8 bits	23 bits
64-bit	`double`	11	52 bits

The 80-bit long double representation is slightly different: it uses 15 bits of exponent, 63 bits bits of mantissa and the implicit 1 in the highest bit of m is actually stored, as a bit between the exponent and mantissa. This bit is referred to as the integer bit and it allows 0 to have a more natural representation, as a mantissa of all 0s (which, in IEEE-754 formats, represents a mantissa of 1.0).

The IEEE-754 standard defines 128- and 256-bit floating-point representations as well, but we won’t worry about them.

x87 Floating Point Instructions

The floating-point instructions that start with f are part of the older x87 floating-point instruction set. These use a separate set of floating-point registers, ST(0) through ST(7), which are treated as a stack. Most of these instructions take no operands, and operate implicitly on the top elements of this stack. In YASM, the FP registers are written as st0, st1, etc.

The reason for this odd organization is that originally, all floating-point was handled by a physically-separate coprocessor CPU. The coprocessor was a separate chip, connected to the bus, so it was able to “listen in”. Floating-point instructions were dispatched by the CPU to the coprocessor. (If it was present; the coprocessor was optional; trying to use FP code without it would trigger an exception.) So the ST(x) registers didn’t “live” on the main CPU but on the coprocessor, and hence, in order to be fast, floating-point calculations had to try to stay on the coprocessor as much as possible.

These days, the FP register stack lives on the same CPU as everything else, but for compatibility with old code, it is still treated as separate. A disadvantage to using x87 instructions is that floating-point arguments to functions are passed in the xmm registers, so it takes some work to get them into the ST registers used by the x87 subsystem. However, there are some operations which are only supported by the x87 subsystem, so it may be worth the effort.

All of the x87 floating point instructions start with f, and there is some overlap between them and the xmm-using instructions. If the operation you want is supported by both, then the choice as to which to use it up to you; both are equally optimized these days. Because the xmm registers are used for arguments/return values, it may take some extra work to move values to/from the FP register stack.

Internally, the x87 subsystem stores every value as 80-bit precision. Values are up/down converted when moving to/from memory as floats or doubles. This means that we get the extra precision “for free”. (This is true in general: high- and low-precision floating-point operations take the same amount of time, so our only concern is space usage.)

Initialization

emms is used to initialize the floating point “coprocessor” by resetting its state. The calling convention requires that the processor be in XMM mode upon entry to any function, so to be on the safe side we’ll always call emms in the beginning of any function using the x87 system.

Floating-point register stack

The 8 st registers, numbered st0 through st7, are implicitly treated as a stack, with st0 on top of the stack. Most x87 instructions use st0 implicitly as the destination of their operation (e.g., fsub writes its result into st0.) All ST registers are caller-saved. Although these registers are called a “stack”, you can also rotate them, shifting the value in st0 into st1, st1 into st2, and so forth, finally shifting st7 into st0. x87 floating point code basically boils down to managing these registers efficiently.

st0 is always implicitly the top of the stack, with lower elements going in ascending order.

Individual entries in the FP stack can either be “in-use” or “free”. Because there are always 8 entries available, “popping” from the stack doesn’t actually remove anything, it just marks the relevant entry as “free” and then rotates.

A pop effectively performs the following two steps:

Rotate all the registers, so st0 now refers to the value formerly in st1, 1 refers to 2, and so forth.

(Internally, the FP subsystem doesn’t actually move anything around; it just maintains a pointer to which register is currently on “top”; rotating just increments this.)
Mark the old st0 as free.

Attempting to pop when st0 is already free, or in general, using a free registers as a source for any FP operation, is a stack underflow.

A push performs the following steps:

Rotate the stack the opposite direction.
Set st0 to the value pushed, and mark it as “in-use”.

Attempting to push into a register which is already “in-use” is a stack overflow and triggers a FP exception.

Pushing values onto the stack is called loading in x87 terminology; there are instructions for loading from memory in various formats:

fld  dword [addr]   ; Push float from memory
fld  qword [addr]   ; Push double from memory
fld  st1            ; Push st1 to st0

fild dword [addr]   ; Push signed dword integer from memory
fild qword [addr]   ; Push signed qword integer from memory

fld1        ; Push +1.0
fldz        ; Push +0.0
fldpi       ; Push π

Note that all of these push onto the stack, so they not only set st0 to the value, they shift the old st0 and everything below it down. There is no way to copy a value from one of the other registers into st0 (fst, below, can copy st0 into any of the other registers).

(There are a few other constants which can be pushed; see here for the full list.)

Note that there are no instructions to load immediate floating-point values. To load a floating point constant, other than those with dedicated instructions, you have to store it in memory (typically in .data or .rodata) and then load it from there. Some simple constants can be synthesized out of the 1,0 that the fld1 and fldz instructions support.

Many instructions have a -p form which also pops the stack after performing the operation. E.g., fst st3 copies ST(0) to ST(3), while fstp st3 does the same thing, but pops afterward.

To make operating on values lower on the stack more convenient, the fxch instruction exchanges the value in another st register with st0. E.g.,

fxch st3    ; Swap st0 with st3

Writing to memory

Writing results from the FP stack back to memory is called a store.

fst/fstp is used to move floating-point values from st0 into other locations in the stack, or from st0 into memory.

fst  dword [addr]  ; Copy float st0 to [addr]
fst  st1           ; Copy st0 to st1
fstp st1           ; Copy st0 to st1 and then pop

Note that for stores-to-memory, the size qualifier (dword or qword) is required, so that the assembler knows whether to copy as float or double.

We can also store into integers, by rounding or truncation:

fist   dword [addr]     ; Write float ST(0) as integer to addr
fistp  dword [addr]     ; Write float ST(0) as integer and then pop
fisttp qword [addr]     ; Write double as trunc. integer and then pop

(64-bit stores can only be done in the -p popping variant.)

Rounding uses the current rounding mode, while truncation just discards any fractional part, effectively rounding towards 0.

Arithmetic operations

Most arithmetic operations are present in several forms:

Single argument (either a st register or a memory operand), with st0 as the implicit destination. E.g.,
```
 fmul st2       ; st0 = st0 * st2
```
Two argument, where the both are ST registers, and one of them is st0. E.g.,
```
 fmul st2, st0  ; st2 = st2 * st0
```
One of the two registers must be st0. You cannot (e.g.) multiply st2 by st3.
-p variants, which perform the operation and then pop st0 off the stack. Typically this is only useful if the destination is not st0.

There are no three-argument forms, so you cannot do st0 = st1 + st2 directly.

The main arithmetic operations are

fadd      ; Addition
fsub      ; Subtraction
fmul      ; Multiplication
fdiv      ; Division

All of these are available as either op stx (meaning st0 = st0 op stx), op st0, stx (meaning the same thing), or op stx, st0 (meaning stx = stx op st0).

Division and subtraction are also available in “reverse” (fsubr, fdivr) forms, which instead of computing st0 = st0 - st(x) compute st0 = st(x) - st(0). There are also integer variants which read the second operand as an integer from memory.

Floating-point function arguments

Because the ABI requires that normal floating point arguments be passed in the xmm registers, it takes some work to get them into the FP stack. We have to move them into memory (typically, local space allocated on the stack) and then from there load into the FP stack. E.g., in a function taking two float arguments:

func:
  ; xmm0 = argument 1
  ; xmm1 = argument 2

  push rbp
  mov rbp, rsp  

  movsd qword [rsp-8],  xmm0
  movsd qword [rsp-16], xmm1

  ; Switch to x87 mode
  emms

  fld qword [rsp-8]   ; Push xmm0 
  fld qword [rsp-16]  ; Push xmm1

  ; Now arg 1 is in st1, arg 2 is in st0

  ...

  add rsp, 16
  pop rbp
  ret

(Note that we are using the “red zone”, the 128 bytes above rsp, to do the transfer, to avoid having to explicitly allocate space on the stack. This kind of “temporary memory” usage is exactly why the red zone exists.)

Returning a floating-point value, when the value exists in the FP stack, similarly involves a round-trip through memory in order to get it into xmm0. E.g., to return st0:

fstp    qword [rsp-8]   ; pop from st0 onto the top of the stack
movsd   xmm0, [rsp-8]   ; load top of stack into xmm0

again using the red zone as temporary storage.

When returning from a function which uses x87 floating point, you are required to free (i.e., mark as empty) all FP registers, typically by doing

fstp st0
fstp st1
...

Floating-point comparisons

The fcomi instruction compares two FP stack elements (the first of which must be st0) and updates the flags register as usual as for an unsigned comparison. (The x87 system had its own internal flags register, which the original fcom instruction would update; it would then take some work to get the x87 flags into the normal flags register. ) The normal conditional jump instruction can then be used. E.g., to copy the larger of st0 and st1 into st0:

  fcomi st0, st1   ; Compare st0, st1
  jge  .done       ; Jump to .done if st0 >= st1
  fld  st1         ; Otherwise push st0 = st1

.done:

The fcmov__ family performs conditional (floating-point) moves, similar to the cmov__ family.

E.g., to copy the larger of st0 and st1 into st1:

fcomi   st0, st1     ; Compare st0 with st1
fcmovl  st1          ; Set st0 = st1 if st0 < st1

Ironically, there is no unconditional FP move instruction which does the same thing (sets st0 to the value from another st register)! All we can do is push.

(The XMM floating-point system has a single instruction which finds the maximum of two values.)

fcomip does the comparison and then pops.

fucomi does an “unordered” comparison; the result of an unordered comparison is different when one of the operands is Not-A-Number:

In an ordered comparison, comparisons between NaN and other numbers result in a floating-point exception.
In an unordered comparison, comparisons between NaN and other numbers always give a true result. I.e., NaN counts as less than, greater than, equal to, and not equal to every other number!

Normally we will prefer the unordered comparison, as it is slightly faster and means we don’t have to deal with floating-point exceptions.

To compare to the constant 0.0, use ftst, which compares st0 with 0. This lets you avoid having to load the constant 0 into one of the other registers for the comparison.

Mathematical functions

The x87 subsystem has single-instruction implementations of many common mathematical functions. (Most of these functions are missing from the XMM system, which is one reason why you might want to use the x87 system.)

; Trig functions
fsin         ; st0 = sin(st0)
fcos         ; st0 = cos(st0)
fsincos      ; st0 = sin(st0), push cos(st0)
fptan        ; st0 = tan(st0), push 1.0
fpatan       ; st1 = atan(st1 / st0) and then pop

fsqrt        ; st0 = sqrt(st0)

fprem1       ; st0 = fmod(st0, st1) (fractional remainder)

fabs         ; st0 = |st0| (absolute value)

fyl2x        ; st1 = st1 × log₂(st0)
f2xm1        ; st0 = 2^st0 - 1

Log/exponents base 2 are used because those are (relatively) easy to compute, given the binary format used for floating-point numbers. Other bases can be used by using the familiar logarithmic identities.

Switching between x87 and XMM modes

When using the x87 system, we have to issue the emms instruction to clear the XMM state. When using the XMM system, no specific “initialization” instruction is necessary, although you should explicitly set the values of any XMM registers you are using, and not assume that their values will be 0.

Floating-point example: computing π

Here we will use the well-known series approximation for π:

$$\pi = 4 \sum_{i=0} \frac{-1^i}{2i + 1} = 1 - \frac{1}{3} + \frac{1}{5} - \cdots$$

We’ll continue the series until the difference between successive values is less than 0.000001 (i.e., 5 digits of precision), and then print the result using printf from the C standard library.

We’ll use the x87 floating-point system for this example, although because printf expects its floating-point arguments to be in the xmm registers, we will have to move the value being calculated from the x87 stack into memory, and from there into xmm0.

In C, the function we’re going to write looks like

double pi() {
    double p = 0.0;   // Current pi approximation
    double pp = 0.0;  // Previous pi approximation
    double s = 1;     // Sign: +1 or -1
    double d = 1;     // Denominator: 1,3,5,...

    do {
        pp = p;        
        p += s / d;
        s *= -1;
        d += 2;        
    } while(abs(p - pp) > 0.000001);

    return 4 * p;
}

(Note that p - pp == s / d.)

Although it’s possible, with some cleverness, to write this procedure so that it runs entirely on the x87 stack, for maximum speed, the easiest way to write it is to store the variables on the stack (in the red zone), and then load/store them as needed. We will of course, have to save the floating point constants 2 and 0.000001 in the .data section:

section .data

ZERO:       dq      0.0
ONE:        dq      1.0
TWO:        dq      2.0
EPSILON:    dq      0.000001

section .text

pi:
    push    rbp
    mov     rbp, rsp

    ; In the red zone
    ; [rsp - 8]  = p = 0.0
    ; [rsp - 16] = s = 1
    ; [rsp - 24] = d = 1
    ; [rsp - 32] = s / d

    mov     rax, qword [ZERO]
    mov     qword [rsp - 8], rax

    mov     rax, qword [ONE]
    mov     qword [rsp - 16], rax
    mov     qword [rsp - 24], rax

.begin_loop:
    ; Update p += s / d
    fld     qword [rsp - 16]       ; = s
    fdiv    qword [rsp - 24]       ; = s / d
    fst     qword [rsp - 32]       ; Save s / d for later
    fadd    qword [rsp - 8]        ; = s / d + p
    fstp    qword [rsp - 8]        ; Write back 
    ffree   st0

    ; Update s = -s
    fld     qword [rsp - 16]        ; st0 = s
    fchs                            ; st0 = -st0
    fst     qword [rsp - 16]        ; Write back
    ffree st0

    ; Update d += 2
    fld     qword [TWO]
    fld     qword [rsp - 24]
    fadd    st0, st1                ; st0 += 2
    fstp    qword [rsp - 24]        ; Write back
    ffree   st0

    fld     qword [rsp - 32]
    fabs
    fld     qword [EPSILON]         ; st0 = 0.000001, st1 = |s / d|
    fucomi  st1                     ; Compare st0, st1
    ffree   st0                     ; Remove st0, st1 from the stack
    ffree   st1

    jb .begin_loop                  ; loop if 0.0000001 > |s / d|

    ; Multiply result by 4
    fld qword [TWO]
    fld qword [rsp - 8]
    fmul st0, st1
    fmul st0, st1
    fstp qword [rsp - 8]
    ffree st0    

    ; Copy last p into xmm0
    movsd xmm0, qword [rsp - 8]

    pop rbp
    ret

A note about clearing (ffree) stack elements: it’s important to ensure that the x87 stack never has more than 8 elements on it. If you push more than 8 things without popping, ST(8) will eventually wrap around to ST(0) and the stack will get corrupted. This will often manifest as values “magically” becoming NaNs. E.g., if we didn’t perform the ffree instructions at the end of the loop, the size of the stack would grow unbounded until it eventually overwrote itself.

XMM Floating-point instructions

The modern floating point unit on the CPU, with its dedicated registers, shares resources with the vector (SIMD) unit (and with the old x87 unit). In fact, most modern floating-point is actually vectorized floating-point code. It’s rare for real code to operate on one value at a time, as we will do.

Registers

There are 16 floating-point registers, named xmm0 through xmm15. The size of these registers is actually 128 bits each, but we will use only the 32- or 64-bit portions, corresponding to the floating-point size we are working with. Generally speaking, the suffix s is used to indicate single-precision (float), while d is used to indicate double-precision (double).

xmm0 through xmm7 are used for passing floating-point arguments. xmm8-15 are temporary (caller-saved) registers.

Moving floating-point values

movss dest, src         ; Move floats
movsd dest, src         ; Move doubles

Two dedicated mov instructions are used to move floating-point values around. As usual, both operands must be of the same size, and the size must match the instruction. The destination and source can be memory or (xmm) register, but both cannot be memory at the same time. Different from the normal mov instruction, the source cannot be an immediate; to load a constant floating point value into a register, it must exist in memory already. Note that register operands must be floating-point registers.

Storing floating-point values in `.data`

Use dd (double-word) to store a 32-bit floating-point value, and use dq (quad-word) to store a 64-bit value. E.g.,

section .data

pi:     dq      3.14159

Floating-point conversions

Instructions exists to convert between the 32- and 64-bit formats, and to convert integers to floating point values, and vice versa.

Instruction	Description
`cvtss2sd dest, src`	Convert 32-bit to 64-bit float `src` can be memory, float register, or a general-purpose register
`cvtsd2ss dest, src`	Convert 64-bit to 32-bit float
`cvtss2si dest, src`	Convert 32-bit float to 32-bit signed integer `dest` can be a float or general purpose register
`cvtsd2si dest, src`	Convert 64-bit float to 32-bit integer
`cvtsi2ss dest, src`	Convert 32-bit signed integer to 32-bit float `dest` must be a float register
`cvtsi2ss dest, src`	Convert 32-bit signed integer to 64-bit float

Arithmetic

addss dest, src         ; dest += src (float)
addsd dest, src         ; dest += src (double)
subss dest, src         ; dest -= src (float)
subsd dest, src         ; dest -= src (double)
mulss dest, src         ; dest *= src (float)
mulsd dest, src         ; dest *= src (double)
divss dest, src         ; dest /= src (float)
divsd dest, src         ; dest /= src (double)

Three-operand versions of all of these are available with a v prefix:

vaddss dest, src1, src2  ; dest = src1 + src2
vaddsd dest, src1, src2  ; dest = src1 + src2
vsubss dest, src1, src2  ; dest = src1 + src2
vsubsd dest, src1, src2  ; dest = src1 + src2
vmulss dest, src1, src2  ; dest = src1 + src2
vmulsd dest, src1, src2  ; dest = src1 + src2
vdivss dest, src1, src2  ; dest = src1 + src2
vdivsd dest, src1, src2  ; dest = src1 + src2

The dest and src operands must be xmm registers; src2 can be register or memory. Sizes of all operands must be the same.

A number of more advanced mathematical operations are provided as instructions:

Instruction	Description
`sqrtss dest, src` `sqrtsd dest, src`	`dest = sqrt(src)`
`rcpss dest, src`	`dest = 1/src`
`rsqrtss dest, src`	`dest = 1/sqrt(src)`
`maxss dest, src` `maxsd dest, src`	`dest =`maximum of `dest, src`
`minss dest, src` `minsd dest, src`	`dest =`minimum of `dest, src`
`roundss dest, src, mode`	Round `src` into `dest` using `mode` `mode = 0` ties go to even `mode = 1` round down `mode = 2` round up `mode = 3` round toward 0

(But note that the trigonometric functions which were available using x87 are not available here.)

Floating-point comparisons

A special floating-point compare instruction is used to compare two floating-point operands, however, the results are written into the normal flags register, as if for an unsigned comparison. Thus, the conditional jump instructions je, jne, ja, jae, jb, and jbe can be used to jump if equal, not equal, greater-than, greater-than-or-equal, less-than, and less-than-or-equal, respectively.

ucomiss dest, src           ; Compare dest and src (dest must be float reg.),
ucomisd dest, src           ; setting rif as for an unsigned comparison

The floating-point system possesses its own flags/status register called mxcsr. The flags in here, rather than being set by floating point instructions, are set by us before floating-point instructions and control global behavior such as rounding modes, whether divide-by-zero should produce an exception or just infinity, etc.

As with x87, we are performing an unordered comparison here, meaning that any comparison in which one of the operands is Not-A-Number will give a true result.

Floating-point arguments to functions

The System-V calling convention specifies that the first eight floating-point arguments to a function are passed in registers xmm0 through xmm7. The result, if any, is returned in xmm0. All of the floating-point registers are caller-preserved, which means you must push them onto the stack before calling any function.

An important thing to note: if you’re using printf, the %f format specifier actually requires a double (64-bit) argument, and not a float. When we use printf in the example, we’ll have to use cvtss2sd to convert our float value to double for printing.

Functions like printf that take a variable number of arguments require an additional change: the number of arguments in xmm registers must be set in al.

Floating-point example: computing π

Here we will use the well-known series approximation for π:

$$\pi = 4 \sum_{i=0} \frac{-1^i}{2i + 1} = 1 - \frac{1}{3} + \frac{1}{5} - \cdots$$

We’ll continue the series until the difference between successive values is less than 0.000001 (i.e., 5 digits of precision), and then print the result using printf from the C standard library.

In C, the function we’re going to write looks like

float pi() {
    float p = 0.0;   // Current pi approximation    
    float s = 1;     // Sign: +1 or -1
    float d = 1;     // Denominator: 1,3,5,...
    float rd;        // 1 / d

    do {
        rd = 1 / d;
        p += s * rd;
        s *= -1;
        d += 2;        
    } while(abs(sd) > 0.000001);

    return 4 * p;
}

Unlike the x87 version, there are no instructions to flip the sign of a xmm register, or to take the absolute value of an xmm register. Hence, we jump through a few hoops:

Flipping the sign corresponds to multiplying by -1.
Instead of taking the absolute value, we compute 1/d and save its value for the comparison, before multiplying by the alternating sign.

Because the xmm registers are used for both integer and floating point math, it’s possible to fake both abs and a sign-flip by directly modifying the bits in the register:

SIGN_BIT:   equ         (1 << 63)

    pxor    xmm0, qword [SIGN_BIT]      ; Flip sign bit
    pandn   xmm0, qword [SIGN_BIT]      ; Clear sign bit

Translating this into assembly gives us a straightforward loop:

section .data

zero:   dd      0.0
one:    dd      1.0
two:    dd      2.0
four:   dd      4.0
negone: dd      -1.0
limit:  dd      0.000001


format: db      "%f", 10, 0

section .text

extern printf

global main
main:

    push rbp
    mov rbp, rsp

    ;; Compute pi    
    call compute_pi
    ; Return value in xmm0  

    ;; Print result
    mov rdi, format
    mov al, 1
    cvtss2sd xmm0, xmm0 ; Convert to double for printf
    call printf

    mov rax, 0
    pop rbp
    ret

compute_pi:
    push rbp
    mov rbp, rsp

    movss xmm7, dword [one]  ; 1.0
    movss xmm0, dword [zero] ; p = 0
    movss xmm1, xmm7   ; s = 1
    movss xmm2, xmm7   ; d = 1
    ; xmm3 = t

.loop:
    movss xmm3, xmm7                ; t = 1
    divss xmm3, xmm2                ; t /= d
    vmulss xmm4, xmm1, xmm3         ; xmm4 = s * t
    addss xmm0, xmm4                ; p += s * t
    mulss xmm1, dword [negone]      ; s *= -1
    addss xmm2, dword [two]         ; d += 2

    ucomiss xmm3, dword [limit]     ; while(t > limit)
    ja .loop

    ; Result is in xmm0
    mulss xmm0, dword [four]

    pop rbp
    ret

When we call printf we have to make one small adjustment: functions that take a variable number of arguments (like printf) require that we set al to the number of arguments passed in xmm registers. It doesn’t have to be exact, but al should be ≥ the number of xmm registers used for arguments. Until now, we’ve never passed anything in xmm, so this didn’t matter. Now, however, we have to set al = 1 before calling printf, so that printf will know how many to use.

More Floating-point

We focus on the two floating-point formats which correspond to float (32-bit) and double (64-bit).

FP Sizes: Most FP instructions exist in ss (single-precision, i.e., float) and sd (double-precision, double) forms. E.g., addition is addss for 32-bit, addsd for 64-bit. We have to explicitly specify the size we want to operate on, because the registers are all the same size (as opposed to rax, eax, ax, etc.). 32-bit corresponds to a dword, while 64-bit is a qword.

Immediates: None of the FP instructions take immediate values. To get constant values into a FP register, you will have to store it in your .data section and then movss/movsd it into the register. Alternatively, you could store an integer immediate into one of the general-purpose registers, and then use the cvtsi2** instructions to convert it to floating-point.

Registers: Floating-point values are stored and manipulated exclusively in the floating-point registers xmm0-xmm15. Almost all floating-point instructions require at least the destination operand to be a FP register. All FP registers are 128 bits wide.

Data movement: movss,movsd are used to move data into, between, and out of the xmm registers. Both operands must be the size specified in the instruction (ss or sd). Memory-to-memory moves are not allowed.

Conversion: Conversion between single- and double-precision, as well as between both and integers, is accomplished by the cvt* family of instructions:

Instruction	Description
`cvtss2sd dest, src`	Convert 32-bit to 64-bit float `src` can be memory, float register, or a general-purpose register
`cvtsd2ss dest, src`	Convert 64-bit to 32-bit float
`cvtss2si dest, src`	Convert 32-bit float to 32-bit signed integer `dest` can be a FP or general purpose register
`cvtsd2si dest, src`	Convert 64-bit float to 32-bit integer
`cvtsi2ss dest, src`	Convert 32-bit signed integer to 32-bit float `dest` must be a FP register
`cvtsi2ss dest, src`	Convert 32-bit signed integer to 64-bit float

The source and destination registers can be the same; you can cvtss2sd xmm0, xmm0 to promote the value in xmm0 from float to double.

(Note: if you’re trying to printf a FP register, printf only accepts double-precision inputs, so you’ll have to cvtss2sd if you’re using single-precision.)

Arithmetic: All arithmetic operations follow the pattern OPss dest, src and have the interpretation dest = dest OP src. There’s no weirdness about multiplication or division.

addss dest, src         ; dest += src (float)
addsd dest, src         ; dest += src (double)
subss dest, src         ; dest -= src (float)
subsd dest, src         ; dest -= src (double)
mulss dest, src         ; dest *= src (float)
mulsd dest, src         ; dest *= src (double)
divss dest, src         ; dest /= src (float)
divsd dest, src         ; dest /= src (double)

All of these are also available in three-operand forms:

vaddss dest, src1, src2  ; dest = src1 + src2
vaddsd dest, src1, src2  ; dest = src1 + src2
vsubss dest, src1, src2  ; dest = src1 - src2
vsubsd dest, src1, src2  ; dest = src1 - src2
vmulss dest, src1, src2  ; dest = src1 * src2
vmulsd dest, src1, src2  ; dest = src1 * src2
vdivss dest, src1, src2  ; dest = src1 / src2
vdivsd dest, src1, src2  ; dest = src1 / src2

Other operations: Square-roots, minimum/maximum, reciprocal, and rounding are available as instructions:

Instruction	Description
`sqrtss dest, src` `sqrtsd dest, src`	`dest = sqrt(src)`
`rcpss dest, src`	`dest = 1/src`
`rsqrtss dest, src`	`dest = 1/sqrt(src)`
`maxss dest, src` `maxsd dest, src`	`dest =`maximum of `dest, src`
`minss dest, src` `minsd dest, src`	`dest =`minimum of `dest, src`
`roundss dest, src, mode`	Round `src` to integer into `dest` using `mode` `mode = 0` ties go to even `mode = 1` round down `mode = 2` round up `mode = 3` round toward 0

Reciprocal, and reciprocal-of-square-root are only available for 32-bit operands. The rounding mode must be an immediate.

Comparisons: Floating-point comparisons are done using ucomiss, ucomisd. These write the results of the comparison into the normal flags registers as if for a unsigned integer comparison. I.e., equality is the e condition code, greater-than is the a condition code, etc. Normal conditional jumps are used. (There are no conditional moves for floating-point values.)

Functions: When calling a function which takes FP arguments, place the first eight arguments in xmm0-xmm7. Functions that return a FP result should place it in xmm0. Variadic functions like printf expect the number of floating-point arguments to be stored in rax.

Note that I misunderstood the emms/femms instruction: it is only required if you are using MMX register/instructions, not the newer xmm registers. So you don’t have to worry about that.

Floating-point example: computing π

Here we will use the well-known series approximation for π:

$$\pi = 4 \sum_{i=0} \frac{-1^i}{2i + 1} = 1 - \frac{1}{3} + \frac{1}{5} - \cdots$$

We’ll continue the series until the difference between successive values is less than 0.000001 (i.e., 5 digits of precision), and then print the result using printf from the C standard library.

In C, the function we’re going to write looks like

float pi() {
    float p = 0.0;   // Current pi approximation
    float pp = 0.0;  // Previous pi approximation
    float s = 1;     // Sign: +1 or -1
    float d = 1;     // Denominator: 1,3,5,...

    do {
        pp = p;        
        p += s / d;
        s *= -1;
        d += 2;        
    } while(abs(p - pp) > 0.000001);

    return 4 * p;
}

I’m going to make a small change to eliminate the call to abs. The difference pp - p is just the most recent term in the sum, so I’m going to save it in another variable:

float pi() {
    float p = 0.0;   // Current pi approximation
    float s = 1.0;   // Sign: +1 or -1
    float d = 1.0;   // Denominator: 1,3,5,...
    float t;         // Last term in the sum

    do {       
        t = 1 / d;      
        p += s * t;
        s *= -1;
        d += 2;        
    } while(t > 0.000001);

    return 4 * p;
}

Translating this into assembly gives us a straightforward loop:

section .data

zero:   dd      0.0
one:    dd      1.0
two:    dd      2.0
four:   dd      4.0
negone: dd      -1.0
limit:  dd      0.000001


format: db      "%f", 10, 0

section .text

extern printf

global main
main:

    push rbp
    mov rbp, rsp

    ;; Compute pi    
    call compute_pi
    ; Return value in xmm0  

    ;; Print result
    mov rdi, format
    mov al, 1
    cvtss2sd xmm0, xmm0 ; Convert to double for printf
    call printf

    mov rax, 0
    pop rbp
    ret

compute_pi:
    push rbp
    mov rbp, rsp

    movss xmm7, dword [one]  ; 1.0
    movss xmm0, dword [zero] ; p = 0
    movss xmm1, xmm7   ; s = 1
    movss xmm2, xmm7   ; d = 1
    ; xmm3 = t

.loop:
    movss xmm3, xmm7          ; t = 1
    divss xmm3, xmm2          ; t /= d
    vmulss xmm4, xmm1, xmm3   ; xmm4 = s * t
    addss xmm0, xmm4          ; p += s * t
    mulss xmm1, dword [negone]      ; s *= -1
    addss xmm2, dword [two]         ; d += 2

    ucomiss xmm3, dword [limit]     ; while(t > limit)
    ja .loop

    ; Result is in xmm0
    mulss xmm0, dword [four]

    pop rbp
    ret

In-class group exercise

Now that we’ve implemented the easy way to compute π, I’ll let you work in groups to compute it faster, using a different approximation series:

$$\pi = 3 + \sum_{i=0}^\infty (-1)^i \frac{4}{(2i + 2) (2i+3) (2i+4)} = 3 + \frac{4}{2 \times 3 \times 4} - \frac{4}{4 \times 5 \times 6} + \cdots$$

This series converges much faster than the traditional one: while the traditional series takes nearly 500,00 iterations to get just five digits of precision, this series gets there in about 12. The other change we’re going to make is for the approximation limit to be a parameter to the function, rather than a fixed constant, and for the function to printf its results after each iteration, so we can see the convergence happening. Here’s a skeleton program for you:

section .data

zero:   dq      0.0
one:    dq      1.0
two:    dq      2.0
four:   dq      4.0
negone: dq      -1.0
limit:  dq      0.000001

format: db      "%f", 10, 0

section .text

extern printf

global main
main:

    push rbp
    mov rbp, rsp

    ;; Compute pi 
    movsd xmm0, qword [limit]   
    call compute_pi
    ; Return value in xmm0  

    ;; Print result
    mov rdi, format
    mov al, 1
    call printf

    mov rax, 0
    pop rbp
    ret

compute_pi:
    push rbp
    mov rbp, rsp

    ; xmm0 = Approximation limit
    ; Return result in xmm0

    ; YOUR CODE HERE

    pop rbp
    ret

Note that I’ve changed all the constants to qwords (i.e., doubles), so don’t forget to use the double-precision versions of the various operations.

It should be possible to change the limit in the .text section to get more (or less) precise approximations.

Floating-point

Floating point representation

Representing 0

Binary representation

Mantissa

Exponent

Normalization

Infinity, NaN, and other special values

64-bit representations

x87 Floating Point Instructions

Initialization

Floating-point register stack

Writing to memory

Arithmetic operations

Floating-point function arguments

Floating-point comparisons

Mathematical functions

Switching between x87 and XMM modes

Floating-point example: computing π

XMM Floating-point instructions

Registers

Moving floating-point values

Storing floating-point values in .data

Floating-point conversions

Arithmetic

Floating-point comparisons

Floating-point arguments to functions

Floating-point example: computing π

More Floating-point

Floating-point example: computing π

In-class group exercise

Storing floating-point values in `.data`