Computer architecture

At a high level, a computer consists of a processor that executes instructions, memory to store data and computations, and some capability for input and output. The operating system handles most of the low-level details associated with the actual architecture, allowing programmers to generally not worry about such details when writing code. However, it is still important to be aware of the hardware your code is running on.

For our purposes a processor (CPU) is the piece of hardware that can perform the arithmetic and logical operations that drive a computer. A CPU contains a number of registers, which are locations for storing data used in computations inside the processor. A CPU has a defined architecture, such as X86 or X86-64, which is a set of instructions that the CPU is cabable of executing; a computer program is a sequence of such instructions. We will focus our attention on X86-64, the 64-bit version of the X86 architecture that is commonly found in contemporary computers.

Memory hierarchy

There are roughly five levels of memory available to a CPU. When the processor needs to utilize a value stored in memory, it first checks to see if that value is stored in a register. If it isn’t, it checks the next layer in the hierarchy, the L1 cache. This process is repeated until it is found. Note that finding values in registers, cache, and RAM is handled by the hardware, while accessing disk drives or the network is controlled by the operating system. These levels, in order of increasing access time, are as follows.

  1. CPU registers.

  2. Cache memory. Small memory banks (generally measured in tens of megabytes).

    The cache is often split into levels L1, L2, and L3, with L1 being the fastest (and smallest) and L3 being the largest (and slowest) memory.

  3. RAM, or main memory.

  4. Disk drives and related storage. This includes hard disk drives, solid state drives, and even tape archives. The speed of accessing these memory types vary.

  5. Networked storage. This includes requesting data from a database over the internet.

Each level of the memory hierarchy takes more time to read and write data. The table below gives rough estimates of long each level of the hierarchy takes to read or write data.

Memory type Access time
Register 1ns
L1 Cache 1ns
L2 Cache 4ns
RAM access 100ns
SSD read 16000ns
HDD seek 4000000ns
Read 1000000 bytes from HDD 2000000ns
Network packet from California to the Netherlands and back 150000000ns

Source.

From this table we can see that memory operations in cache are far faster than those in RAM, which are far faster than those reading or writing from disk or the network.

Cache efficiency

A cache efficient algorithm or program is one that utilizes the memory heirarchy of the machine efficiently. To understand how to do this we need to first understand the hierarchy of memory that a machine has available, and then understand how one particular level of that hierarchy — the cache — behaves.

At a high level, data are stored in each layer of the cache. All data are replicated in each lower level; that is, if some data are in the L1 cache, they are also in the L2 cache, and so on. When the CPU requests data, each level of the cache is searched, and then RAM is checked if all cache checks fail. When the information is eventually found it is stored in every cache level and given to the CPU. Since we store the data in the cache if the CPU asks for that data again some time later there is a chance that it can be retrieved from L1, L2, or L3 cache without having to access main memory.

Of course, each level of the cache can only contain so much data. When a value needs to be stored in a cache layer, some existing data must be replaced. Choosing which data to replace to give efficient read performance is classic research problem called the paging problem. We discuss strategies for this problem in the replacement strategies section.

The following sections discuss how the cache stores, writes, and retrieves data in more detail. It is important to be aware of this information when designing cache efficient programs, but note that this behavior cannot be controlled by your code.

The cache

The cache is actually a number of closely-related memory banks. The L1 cache is the smallest and fastest of these banks, followed by layers L2, L3, ... . However, the behavior of each cache layer can be modeled in the same fashion, and so for the remainder of this section we focus on any given layer of the cache.

Consider a cache layer of size \(M\) bytes. The cache is broken into \(M / L\) cache lines of length \(L\). Data is always stored and retrieved by cache line; we cannot read or write a particular byte from the cache without reading or writing the surrounding cache line. A cache line always stores sequential addresses in a cache line. Thus when we access a memory address, we generally are able to access the next several addresses at no additional cost. For this reason algorithms that read from sequential memory addresses typically outperform algorithms with access in a non-uniform order.

Note that while the cache is small compared to main memory, even an 8 megabyte cache would be too large to search to see if the requested data is the cache. Additionally, when we want to store a value in cache we need to choose where to put it. To handle this, one of several mapping strategies is employed.

  1. Direct mapping

    In this strategy each memory address can be assigned to exactly one cache line. You can imagine using the trailing bits of an address to determine the offset of the address in a cache line, and then some additional bits to determine which cache line we choose. For example, if the memory address is \(n\), then we could use \(n \bmod (M / L)\) as the cache line and \(n \bmod L\) as the cache line offset.

  2. k-way associativity

    In a k-way associative cache each memory address can be assigned to k distinct cache lines. We call a set of k cache lines a cache set. Thus each memory address can be mapped to a single cache set, which we could compute for a memory address \(n\) as \(n \bmod (k M / L)\). When reading from cache we must check each of the k cache lines for the value we want. When writing to the cache we must decide which of the existing k cache lines needs to be replaced. Here we use a replacement strategy to decide which line is removed.

Replacement strategies

A replacement strategy is a policy for deciding which cache line to remove from a cache set to make room for a new cache line. Two natural policies are as follows.

  1. First in, first out.

    Here we track the order in which lines are added to the cache, and always overwrite the oldest cache line.

    This strategy is very simple to implement, and works well when data is loaded into cache and referenced very few times before no longer being needed

  2. Least recently used.

    Here we track the order in which the lines have been accessed, and always overwrite the least recently accessed cache line.

    This strategy requires more bookkeeping, but allows a single cache line to live in the cache for a very long time before being replaced if it is accessed very often.

There are of course many more strategies that could be employed.

Binary and hexadecimal

All data in a computer are represented as binary digits. Each architecture has a word size describing the natural size of data used in most instructions. We will concentrate on X86-64 which has a 64-bit word size. Hexadecimal is often used to represent binary data because it is more compact than writing binary — in particular, a 64-bit binary number can be represented with just 16 hexadecimal numbers.

A binary number is a base-2 number, using the digits 0 and 1 to represent multiples of powers of 2; a hexadecimal number is a base-16 number, using the digits 0, 1, 2, 3, 4, 5, 6, 7, 8,9 A, B, C, D, E, F to represent multiples of powers of 16. (The digits A, B, C, D, E, and F represent 10, 11, 12, 13, 14, and 15, respectively.) Throughout this section we will disambiguate the radix of numbers by adding a \(b\) subscript to binary numbers, a \(d\) subscript to decimal numbers, and a \(h\) subscript to hexadecimal numbers. In the absence of a subscript we assume the number is decimal.

To begin, we represent the decimal number \(123_d\) in each format.

To convert between bases, we simply write the number as a sum of multiples of the appropriate base. Consider \(123_d\). The largest power of 2 less than 123 is 64, so we can write

\[\begin{split}\begin{align*} 123_d & = 64 + 32 + 16 + 8 + 2 + 1 \\ & = 1 \cdot 2^6 + 1 \cdot 2^5 + 1 \cdot 2^4 + 1 \cdot 2^3 + 0 \cdot 2^2 + 1 \cdot 2^1 + 1 \cdot 2^0 \\ & = 1111011_b \end{align*}\end{split}\]

Below are some more examples of conversions between bases.

Binary Decimal Hexadecimal
\(1111011_b\) \(123_d\) \(7{\texttt{B}}_h\)
\(10000000000000000_b\) \(65536_d\) \(10000_h\)
\(1111111111111111_b\) \(65535_d\) \(\texttt{FFFF}_h\)

An important consequence of the word size of an architecture is the maximum value of an integer that can be supported with one word. Since we are limited to 64 bits, the the largest integer we can represent naturally with a word is \(2^{64} - 1\), represented by 64 consecutive ones in binary. However, we often want to represent negative integers as well. For this we use two’s complement.

Two’s complement

We consider \(N\)-bit numbers. The two’s complement of an \(N\)-bit binary number \(x\) is \(2^N - x\). We represent positive integers as their natural form from above, and represent negative integers as the two’s complement of the absolute value of that number. As an example, take \(N = 4\) and \(x = 5\). Then we represent \(x\) as \(0101_b\) and \(-x\) as \(2^4 - x = 10000_b - 00101_b = 01011_b\). Since we only have 4 bits, we truncate the leading 0 and find \(-x\) is represented as \(1011_b\).

This representation allows us to express the integers between \(-2^{N - 1}\) and \(2^{N - 1} - 1\). Note that all negative numbers have a leading ones digit while the non-negative numbers have a leading zero digit. Additionally, you can think of a negative number \(-x\) being expressed as a leading 1 digit followed by its offset from \(2^{N - 1}\); i.e. \(1011_b\) is \(-2^{4 - 1} + 2 + 1 = -5_d\).

To see why two’s complement is a useful representation of the integers, consider the operations of addition, subtraction, and multiplication. Ignoring problems with overflow (e.g. operation that will produce values outside the range \([-2^{N - 1}, 2^{N - 1} - 1]\)) these operations can be performed in their natural fashion and produce the correct result.

CPU

The central processing unit (CPU) is the component of a computer that actually executes program instructions. A CPU has many components, but some important ones are

  • Control unit (CU)

    The control unit controls the execution of program instructions.

  • Arithmetic logic units (ALU)

    A digital circuit capable of performing arithmetic and bitwise operations on integer binary numbers.

  • Floating-point units (FPU)

    A digital circuit capable of performing arithmetic on floating point numbers.

  • Various registers

    A register is a storage location for binary data. These might be used to store data or addresses of instructions to be executed.

The basic life cycle of a program being run on a CPU is as follows. We maintain a program counter (PC) or instruction pointer (IP) in a dedicated register. The PC contains the address of the next instruction to be executed after the current instruction. If the current instruction is a branch, function call, or return instruction the PC might be overwritten as the result of the current instruction. In the case of a function call the current PC will be stored somewhere (e.g. on the stack).

Instruction pipeline

The CPU can execute a number of instructions depending on the instruction set of the CPU. In X86-64, instructions include: data instructions, such as move and copy; flow-control instructions, such as branching, looping, or calling a function; arithmetic instructions, such as add, subtract, multiply, and divide; and many others. A simple view of the CPU is that each clock cycle a single instruction is executed. However, this is not true in modern machines. For example, we know that if our instruction is to add two numbers and one number isn’t already in a CPU register, we have to wait many clock cycles for that data to be fetched from cache (or RAM). Additionally, modern CPUs implement pipelining.

Instruction pipeline is a technique for executing more than one instruction at the same time. If the CPU has many APUs, for example, it can execute many add instructions per clock cycle. To allow for this, the CPU may load many instructions at once into its pipeline, executing instructions in the pipeline in parallel or even out of order. There are of course problems this could cause (i.e. not all operations — especially read/write operations — are not commutative) but we let the CPU and compiler worry about this. However, there are still problems we do need to worry about. Consider the following C++ code.

auto f(long data[], long n) -> long
{
    auto result = 0;
    for (auto i = 0; i < n; i++) {
        if (data[i] > 0) {
            result += data[i];
        } else {
            result -= 1;
        }
    }
    return result;
}

Suppose we were to try to load the instructions inside one loop iteration into the pipeline. There is a branch statement inside from the if-statement, and so we don’t know whether to load the instructions from the true branch or the instructions from the false branch. There are several choices the CPU can make: simply load a shorter pipeline, waiting until the result of the if-statement conditional is known and then loading the appropriate instruction sequence; guess which branch will be taken, and load that one; or load both branches and discard the appropriate results later. In the first case we introduce extra latency since we need to spend time loading a new pipeline. In the second case we might have the wrong instructions loaded, and we need to halt the pipeline, reload the correct branch, and continue. In the third case both branches might not even fit inside the pipeline, and we’re left choosing between the first two options. Many CPUs choose option 2, called branch prediction.

Branch prediction

Branch prediction is the process of guessing which of two possible conditional branches will be loaded into the instruction pipeline. A correct guess results in fast execution of the code, while an incorrect prediction introduces latency as the pipeline must be cleared and reloaded. Thus it is important for the CPU to accurately predict which path is taken. (Note that we really only worry about this inside of a loop.)

The simplest approach we could consider is to always predict that a branch will follow the same path it did last time. The CPU could maintain a single bit from the last branch executed, where a value of 1 means it took the true path and 0 means it took the false path. When loading an instruction pipeline with a branch, we make our prediction based on this bit. For some inputs, this simple strategy is quite good. Consider the code from above running on a sorted array. Here some prefix of the data will all result in the false branch, while every entry after that prefix results in a true branch. Our branch predictor will predict all but one branch correctly after seeing the first result.:

Result     FFFF...FTTTT...T
Prediction ?FFF...FFTTT...T
Miss       ?       X

However, if the data follow a simple pattern such as “true, true, false”, our predictions will be terrible.:

Result     TTFTTFTTFTTFTTFTTF
Prediction ?TTFTTFTTFTTFTTFTT
Miss       ? XX XX XX XX XX X

Here we miss 2/3 of all predictions — worse than randomly guessing! A more reasonable approach is to use a saturating counter, which keeps a running tally of the number of times the branch has been true compared to false, up to some limiting value. If we consider a 2-bit saturating counter, we can do much better on the last example. Here we add one to the count whenever a branch is true and take one away whenever it is false, and we don’t overflow. If the value is 0 or 1 we predict false and if the value is 1 or 2 we predict true. On the sorted data, we get:

Result     FFFF...FTTTT...T
Count      1000...00123...3
Prediction FFFF...FFFTT...T
Miss               XX

On the patterned data, we get:

Result     TTFTTFTTFTTFTTFTTF
Count      123233233233233233
Prediction FTTTTTTTTTTTTTTTTT
Miss       X X  X  X  X  X  X

Here we will eventually correctly predict 2/3 of the branches, twice as many as the simple predictor.

More complicated methods can be employed to detect these and other patterns.

Avoiding branch prediction failure

There are many ways to help your code avoid branch prediction failure. One way is to re-structure code so that conditional statements don’t appear inside of loops. This can be done with clever bitwise operations or simple code re-writes, e.g. changing

for x in data:
    if f(z):
        g(x)
    else:
        g(x)

into

if f(z):
    for x in data:
        g(x)
else:
    for x in data:
        h(x)

This is somewhat cumbersome because we are duplicating logic (i.e. the for-loop) but even this can be abstracted away:

def loop_with(data, func):
    for x in data:
        func(x)

loop_with(data, g if f(z) else h)