IEEE 754

IEEE 754 is a standard for floating-point arithmetic implemented in most hardware. The term floating-point refers to the fact that these numbers can move the decimal point in a rational number to adjust the precision. In the abstract, we could represent a rational number \(x\) by

$$x = s a b^c$$

for the sign \(s \in \{-1, 1\}\), the mantissa or significand \(a \in \mathcal{Z}_{+}\), the base \(b > 1\), and the exponent \(c \in \mathcal{Z}\). In IEEE 754 we take the base of the exponent to be 2. In particular, a double-precision float consists of a total of 64 bits:

  • 1 bit for the sign of the number.

    A value of zero is used for positive numbers while a value of one is used for negative numbers; note that the zero value is, in fact, signed.

  • 11 bits for the exponent.

    The exponent is stored as a binary integer, but is interpreted as having a value 1023 less than stated. Thus the value 01111111111, which ordinarily represents 1023, indicates an exponent of zero. The minimum exponent that can be represented is 00000000001, representing -1022; the value 00000000000 is used to represent a signed zero value when the mantissa is zero, and subnormal numbers when the mantissa is nonzero. The maximum exponent that can be represented is 11111111110, representing 1023; the value 11111111111 is used to represent signed infinities when the mantissa is zero, and not-a-number values when the mantissa is nonzero.

  • 52 bits for the significand or mantissa

    Since we assume the leading digit is a 1, we actually have 53 bits, only 52 bits of which are stored. The interpretation of the significand is a fractional binary number of the form \(1.b_{0}b_{1}\ldots b_{51}\), where \(b_0, b_1, \ldots, b_{51}\) are the digits of the significand.

Most languages use IEEE754 doubles as their basic data type for rational numbers (C and C++ via double, Python via float). It is important to remember the limitations of floating-point arithmetic.

Rounding error

Floating point numbers have finite precision and are represented in base 2. Finite precision implies that there is a smallest granularity we can represent; recall that our exponent can get no smaller than -1022. Base 2 implies that certain numbers cannot be represented exactly. Take for example the number 1/3. In base 10 this cannot be represented exactly; we often write 1/3 = 0.333333... However, in base 3 this would be represented exactly as 0.1. We run into similar problems in base 2; in particular,

>>> print('{:.17f}'.format(0.1))
0.10000000000000001
>>> 0.1 + 0.1 + 0.1 == 0.3
False
>>> 0.3 - 0.1 - 0.1 - 0.1
-2.7755575615628914e-17

As a general rule, you should not compare floats using equality unless you know exactly what you’re doing. For many problem domains it is appropriate to model two numbers that are “close enough” as being equal, e.g. using the following function to compare floating point numbers.

>>> def within(eps, a, b):
        return abs(a - b) <= eps
>>> within(1e-16, 0.3, 0.1 + 0.1 + 0.1)
True
>>> within(1e-17, 0.3, 0.1 + 0.1 + 0.1)
False

A more pernicious example is the following.

>>> a = float(2**53)
>>> b = 1
>>> a == a + b
True

Recall that the mantissa has 52 bits and 53 bits of precision. Since the exponent of a has a value of 53, the least significant digit of the mantissa represents a value of \(2^{-52}\), so the value we are adding — \(1 = 2^{-53} \cdot 2^{53}\) — does not modify the mantissa.

Errors of this form can break “proofs” of correctness for algorithms that rely on exact arithmetic when deriving their correctness. A classic example would be an algorithm that relies the product of a matrix and its inverse being the identity matrix. The matrix

\[\begin{split}A = \left[\begin{array}{cc}1 & 10000 \\1/9999 & 1\end{array}\right]\end{split}\]

has a condition number of 999900019995, and if we ask Numpy to compute \(A \cdot A^{-1}\) we find

\[\begin{split}A \cdot C^{-1} = \left[\begin{array}{cc}1 & 0 \\ 1.1102230246251565 \cdot 10^{-16} & 0.99999999999818101\end{array}\right].\end{split}\]

Overflow and underflow

Overflow is defined as when a number is larger in magnitude that can be represented (e.g. very far from zero), while underflow is when a number is smaller in magnitude that can be represented (e.g. very close to zero).

As an example of overflow, consider the following.

>>> a = 1e 308
>>> a == (a + a) - a
False
>>> a + a
inf

Before considering underflow, we need to understand subnormal numbers.

Subnormal numbers

If we use our existing definition of floating point numbers, we cannot represent numbers smaller in magnitude than \(2^{-1022}\) because of the implicit leading 1 in the significand. Thus the first 3 representable non-negative numbers are \(0\), \(2^{-1022}\), and \(2^{-1022} + 2^{-1022 - 52}\); the final value is found by flipping the least significant bit of the significand. Note that the gap between the first two numbers is 52 orders of magnitude larger than the gap between the second two numbers. This means that we suddenly lose a huge amount of precision once values get close to zero. To avoid this, IEEE defines subnormal numbers. In a subnormal number the leading digit of the significand is assumed to be a zero; thus by adding additional leading zeros we can represent smaller numbers that we otherwise could. This is only done when the exponent is its smallest value

Catastrophic cancellation

This refers to the scenario when operations on two numbers introduce relative error that is large compared to the absolute error. This can occur, for example, if we have computed two values of the same magnitude and then subtract them. Suppose we perform calculations that on paper yield \(x\) and \(y\), but in floating point yield \(x + \epsilon\) and \(y + \delta\) due to floating point error. The difference \(z = x - y\) should be close to zero, but we instead get \(z + \epsilon - \delta\). If \(z\) is small compared to \(\epsilon\) or \(\delta\), then the relative error \((z + \epsilon - \delta) / z = 1 + \epsilon / z - \delta / z\) can be extremely large.