IEEE 754

“There are 10 kinds of people in this world: those who understand binary numerals, and those who don’t.”

—Ian Stewart

IEEE 754 is a standard for floating-point arithmetic implemented in most hardware. The term floating-point refers to the fact that these numbers can move the decimal point in a rational number to adjust the precision. In the abstract, we could represent a rational number \(x\) by

$$x = s a b^c$$

for the sign \(s \in \{-1, 1\}\), the significand \(a \in [1,b)\), the base \(b > 1\), and the exponent \(c \in \mathcal{Z}\). In IEEE 754 we take the base of the exponent to be 2. In particular, a double-precision float consists of a total of 64 bits:

  • 1 bit for the sign of the number.

    A value of zero is used for positive numbers while a value of one is used for negative numbers; note that the zero value is, in fact, signed.

  • 11 bits for the exponent.

    The exponent is stored as a binary integer, but is interpreted as having a value 1023 less than stated. Thus the value 01111111111, which ordinarily represents 1023, indicates an exponent of zero. The minimum exponent that can be represented is 00000000001, representing -1022; the value 00000000000 is used to represent a signed zero value when the mantissa is zero, and subnormal numbers when the mantissa is nonzero. The maximum exponent that can be represented is 11111111110, representing 1023; the value 11111111111 is used to represent signed infinities when the mantissa is zero, and not-a-number values when the mantissa is nonzero.

  • 52 bits for the mantissa

    Since we assume the leading digit is a 1, we actually have 53 bits, only 52 bits of which are stored. The interpretation of the significand is a fractional binary number of the form \(1.b_{0}b_{1}\ldots b_{51}\), where \(b_0, b_1, \ldots, b_{51}\) are the digits of the mantissa.

Most languages use IEEE754 doubles as their basic data type for rational numbers (C and C++ via double, Python via float). In addition to representing floating point values, IEEE 754 also allows representation of two other numeric data types.


There is a systematic way to represent infinity by using floating point. This is done by setting all the exponent bits to 1, and all the mantissa bits to 0. The sign bit can take either 0 or 1, and this allows one to represent both positive and negative infinity. IEEE 754 specifies that 1.0/0.0 should return positive infinity, and -1.0/0.0 should return negative infinity. Some languages follow this, while other languages (e.g. Python) do something different, so it is important to know in advance how these corner cases will be handled. Regardless, floating point infinity follows all the usual conventions.

>>> a = float('inf')
>>> a
>>> a + 1
>>> 2 * a
>>> -a
>>> 1/a
>>> a == float('inf')

The last line shows that it is easy to check for positive infinity using a simple conditional statement. Even though the behavior is fairly predictable, it is worth mentioning that none of these statements will throw an exception. Generally, it is better for code to fail in a more noticeable manner, so unless you want to explicitly deal with infinity, it is worth checking for any functions that may return them (for example, logarithmic functions).


NaN stands for Not a Number, and is used to represent a numerically undefined value. This is done by setting all exponent bits to 1 (as with infinity), and having at least one bit in the significand set to 1. IEEE 754 specifies that 0.0/0.0 should return NaN. In addition to this, abusing floating point infinity can lead to NaN showing up in your numerical values.

>>> a = float('inf')
>>> a/a
>>> a - a

NaN can be particularly bad because it is the result of numerical code quietly failing, and it can have disasterous consequences for any remaining computations.

>>> b = float('nan')
>>> b
>>> b + 1
>>> b - b
>>> 4 * b
>>> b / b
>>> b == b
>>> b != b
>>> b == float('nan')

The last three lines show another problem: detectability. IEEE 754 states that NaN == NaN should always return False, and this is one (clumsy) way to check for NaN. Usually, there is a language-specific function that checks for NaN. In the case of Python, one can use NumPy.

>>> import numpy as np
>>> b = float('nan')
>>> b
>>> np.isnan(b)

Depending on the language, any mathematical function that is undefined for a specific input will usually return NaN when implemented. For example, one can expect log(-1.0) and sqrt(-1.0) to return NaN unless there is some built-in capability for imaginary values. To deal with NaN, catch a numerical problem before it happens, and write code that throws exceptions instead of returning NaN.

Rounding error

It is important to remember the limitations of floating-point arithmetic. Floating point numbers have finite precision and are represented in base 2. Finite precision implies that there is a smallest granularity we can represent; recall that our exponent can get no smaller than -1022. Base 2 implies that certain numbers cannot be represented exactly. Take for example the number 1/3. In base 10 this cannot be represented exactly; we often write 1/3 = 0.333333... However, in base 3 this would be represented exactly as 0.1. We run into similar problems in base 2; in particular,

>>> print('{:.17f}'.format(0.1))
>>> 0.1 + 0.1 + 0.1 == 0.3
>>> 0.3 - 0.1 - 0.1 - 0.1

As a general rule, you should not compare floats using equality unless you know exactly what you’re doing. For many problem domains it is appropriate to model two numbers that are “close enough” as being equal, e.g. using the following function to compare floating point numbers.

>>> def within(eps, a, b):
        return abs(a - b) <= eps
>>> within(1e-16, 0.3, 0.1 + 0.1 + 0.1)
>>> within(1e-17, 0.3, 0.1 + 0.1 + 0.1)

A more pernicious example is the following.

>>> a = float(2**53)
>>> b = 1
>>> a == a + b

Recall that the mantissa has 52 bits and 53 bits of precision. Since the exponent of a has a value of 53, the least significant digit of the mantissa represents a value of \(2^{-52}\), so the value we are adding — \(1 = 2^{-53} \cdot 2^{53}\) — does not modify the mantissa.

Errors of this form can break “proofs” of correctness for algorithms that rely on exact arithmetic when deriving their correctness. A classic example would be an algorithm that relies the product of a matrix and its inverse being the identity matrix. The matrix

\[\begin{split}A = \left[\begin{array}{cc}1 & 10000 \\1/9999 & 1\end{array}\right]\end{split}\]

has a condition number of 999900019995, and if we ask Numpy to compute \(A \cdot A^{-1}\) we find

\[\begin{split}A \cdot A^{-1} = \left[\begin{array}{cc}1 & 0 \\ 1.1102230246251565 \cdot 10^{-16} & 0.99999999999818101\end{array}\right].\end{split}\]

Overflow and underflow

Overflow is defined as when a number is larger in magnitude that can be represented (e.g. very far from zero), while underflow is when a number is smaller in magnitude that can be represented (e.g. very close to zero).

As an example of overflow, consider the following.

>>> a = 1e 308
>>> a == a + (a - a)
>>> a == (a + a) - a
>>> a + a

This is another example of how finite precision can break mathematical theorems taken for granted, such as commutivity of addition. Before considering underflow, we need to understand subnormal numbers.

Subnormal numbers

If we use our existing definition of floating point numbers, we cannot represent numbers smaller in magnitude than \(2^{-1022}\) because of the implicit leading 1 in the significand. Thus the first 3 representable non-negative numbers are \(0\), \(2^{-1022}\), and \(2^{-1022} + 2^{-1022 - 52}\); the final value is found by flipping the least significant bit of the significand. Note that the gap between the first two numbers is 52 orders of magnitude larger than the gap between the second two numbers. This means that we suddenly lose a huge amount of precision once values get close to zero. To avoid this, IEEE defines subnormal numbers. In a subnormal number the leading digit of the significand is assumed to be a zero; thus by adding additional leading zeros we can represent smaller numbers that we otherwise could. This is only done when the exponent is its smallest value. With the inclusion of subnormal numbers, the first three non-negative representable values are \(0\), \(2^{-1022-52}\), and \(2^{-1022-51}\).


If at any point during the calculation an intermediate value is less than \(2^{-1022-52}\) in magnitude, the value is rounded to zero. For example, consider

>>> b = 2**(-1022-52)
>>> b
>>> b == b/2 * 2
>>> b/2

Again, one must be mindful of underflow and overflow influencing intermediate calculations in a routine.

Numerical analysis

Since floating point only offers a finite amount of precision, it is important to be able to analyze how small perturbations in the input of a numerical function can affect its output. This field is called numerical analysis, and it is studied widely in many different subject areas pertaining to numerical computation.

There are two ways of measuring the quantity of error in numerical computations. The first is called absolute error, which refers to the difference

$$ \text{Err}(\hat{x}, x) = \hat{x} - x, $$

and the second is called relative error, defined as

$$ \text{Rel}(\hat{x}, x) = \frac{\hat{x} - x}{x}. $$

Usually, relative error is more informative about the quantity of error since it measures error proportionally to the true quantity. One goal of numerical analysis is for a given function \(f\) to write \(\text{Rel}(f(\hat x),f(x))\) (this is called propagated error) in terms of \(\text{Rel}(\hat{x},x)\). A well conditioned function is one where a small perturbation in its input results in a small perturbation in its output, whereas a poorly conditioned function is just the opposite.

This can also be done for operators. Suppose \(\hat{x} = x + \delta\) and \(\hat{y} = y + \epsilon\). How does the product \(\hat{x}\hat{y}\) compare with the true answer \(xy\)? We can show

$$\begin{align} \hat{x}\hat{y} &= (x+\delta)(y+\epsilon) \\ &= xy + \epsilon x + \delta y + \delta \epsilon, \end{align}$$

Subtracting and dividing both sides by \(xy\) gives us

$$ \text{Rel}(\hat{x}\hat{y},xy) = \text{Rel}(x,\hat{x}) + \text{Rel}(y,\hat{y}) + \text{Rel}(x,\hat{x})\cdot\text{Rel}(y,\hat{y}). $$

If the initial relative errors of \(\hat{x}\) and \(\hat{y}\) are very small in magnitude, then the relative error of their product is approximately additive and will not grow large. Similarly, one can derive for division

$$ \text{Rel}(\hat{x}/\hat{y},x/y) = \frac{\text{Rel}(\hat{x},x) - \text{Rel}(\hat{y},y)}{1-\text{Rel}(\hat{y},y)}. $$

Usually relative error is very small in magnitude, and therefore, the relative error of the quotient is approximately the difference between that of the dividend and divisor. Thus, division does not cause relative error to explode. Numerically speaking, multiplication and division are relatively safe operations.

Catastrophic cancellation

The same cannot be necessarily said for addition and subtraction. It is relatively straightforward to show that the absolute error of \(\hat{x} + \hat{y}\) is additive, or in other words,

$$ \text{Err}(\hat{x}\pm\hat{y},x\pm y) = \text{Err}(\hat{x},x)\pm\text{Err}(\hat{y},y). $$

The story for relative error can be different. Adding two numbers of the same sign (or substracting those with different signs) is numerically safe, and will not cause the relative error of the result to grow large. However, adding two numbers with opposite sign (or subtracting two numbers with the same sign) can lead to a large increase in relative error.

For example, consider the function

$$ f(x) = \frac{e^x - 1}{x}. $$

This function is difficult to compute for small values of \(x\), since \(e^x\) is very close to 1 in this range. When the subtraction is performed, many of the first bits in the mantissa will agree and will be subtracted off. But there are only 53 bits of precision in the significand, and we cannot magically regain precision for the bits that are now zero in the beginning of the mantissa. This phenomenon is called catastrophic cancellation, and it can wreak havoc on the relative error in numerical calculations.

Let’s see how this affects the result. We can construct a poor implementation of this function in Python using numpy.exp. But as with all things in life, there is a better solution. The function numpy.expm1 calculates the numerator to avoid catastrophic cancellation.

>>> import numpy as np
>>> ##
>>> bad_fn  = lambda x: (np.exp(x)-1)/x
>>> good_fn = lambda x: (np.expm1(x))/x
>>> rel_err = lambda x: (bad_fn(x) - good_fn(x))/good_fn(x)
>>> ##
>>> print("{:.17f}".format(bad_fn(1e-10))
>>> print("{:.17f}".format(good_fn(1e-10))
>>> rel_err(1e-10)

The relative error means that bad_fn is only accurate to 8 digits. Double precision offers a relative error of \(2^{-52}\), which is about 16 decimal digits. Using a bad implementation costed half of the precision available to us, and the error can add up over time.

How can errors like these be avoided? Using built-in functions is the best option, but if that is not an option, there are other ways. Sometimes, these errors can be avoided exactly by reformulating a function you are trying to compute (for example, how can you compute \(\log(x+1) - \log(x)\) for large \(x\)?). If that is not possible, then there are numerically stable ways to approximate these functions using Taylor expansions. For instance, for \(x < 10^{-8}\), what is an effective way to compute \(f(x)\) from above?


Although algorithms may change over time, numerical analysis has been relatively consistent for the past 40 years. I highly suggest An Introduction to Numerical Analysis (2nd Edition) by Kendall Atkinson, which covers numerical analysis in a variety of contexts. Of course, there are entire courses that cover this in many different applications, from computations involving linear algebra and matrices to solving differential equations and minimizing functions.