Divide and conquer algorithms

There are problems with special structure that allow one to break it down into several smaller subproblems and combine the results. This the approach behind divide and conquer algorithms. They are typically easy to analyze and implement, and the ability to split a potentially large problem into many smaller ones makes this scheme ripe for a parallelized approach.

Typically, the mathematical tool for analyzing divide and conquer algorithms is recursion. Given an input size \(n\), one can usually find an expression for run time or memory usage \(T(n)\) in terms of smaller instances of the problem. For example, one can use merge sort to order a list. One can express the running time as \(T(n) = 2T(n/2) + n\), with the first term capturing the time required to sort the two halves, and the second term representing the number of operations required to merge the two. The base case for this recursion could be when the list is a singleton, meaning that \(T(1) = 1\). There are also more sophisticated choices for base cases: for instance, if the list is smaller than \(k\) elements, one could use any preferred sorting algorithm, and that would provide a value for \(T(j)\) for \(j \le k\).

Master theorem

The master theorem is a mathematical result that can find the asymptotic behavior of \(T(n)\) if it is recursively defined. Suppose \(T(n)\) is defined as

$$ T(n) = aT(n/b) + f(n), $$

where \(a \ge 1\) and \(b > 1\). Then we have the following three cases.

  • If \(f(n) = O(n^c)\) where \(c < \log_b a\), then

    $$ T(n) = O(n^c).$$

  • If \(f(n) = O(n^c \log^k n)\) for some \(k \ge 0\) and \(c = \log_b a\), then

    $$ T(n) = O(n^c \log^{k+1} n). $$

  • If \(f(n) = O(n^c)\) where \(c > \log_b a\), and additionally if there exists \(p < 1\) such that for sufficiently large \(n\) we have \(af(n/b) \le p f(n)\), then

    $$T(n) = O(f(n)).$$

From our previous example of the binary search, \(a=2\), \(b=2\), \(c=1\), and \(k=0\). Then the second case applies, implying that \(T(n) = O(n \log n)\), as expected for an efficient sorting algorithm.

Recursive calls

For divide and conquer algorithms, it is natural to write them using recursion explicitly. However, this is not always the best choice. Consider a naive implementation of a function that calculates terms of the Fibonacci sequence.

def fib_bad(n):
   if n <= 1:
      return 1
   return fib_bad(n-1) + fib_bad(n-2)

The program is correct and outputs the correct result, but running the program takes a long time even for moderate values. Using a profiler to investigate provides insight into why this is the case.

>>> import cProfile
>>> cProfile.run('fib_bad(30)')
         2692540 function calls (4 primitive calls) in 0.577 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
2692537/1    0.576    0.000    0.576    0.576 <stdin>:1(fib_bad)
        1    0.000    0.000    0.576    0.576 <string>:1(<module>)
        1    0.000    0.000    0.577    0.577 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Running fib_bad to find the 30th term requires recursively running fib_bad 2,692,537 times. This is because the program needs to recalculate previous terms in the sequence many, many times. To be precise, when we call fib_bad(30), the interpreter realizes that it needs to find fib_bad(29) and fib_bad(28). It starts computing fib_bad(29), and adds fib_bad(28) to the call stack, which keeps track of the remaining routines to be executed. When the interpreter is done finding fib_bad(29), it can pop from the stack and execute fib_bad(28). The problem is that the CPU does not keep track of the previous work it did to find the 29th term, and therefore performs all of the same operations. And to make it worse, this problem occurs for finding any term in the sequence (think about how many times fib_bad(5) was recalculated). Even worse than that, there is memory overhead associated with pushing a function call to the stack. If the call stack gets too large, there may be memory problems, causing the system to crash. This phenomenon is called a stack overflow. For this reason, many languages put a cap on the maximum size for the call stack, and if this threshold is met, the program is terminated.

We can analyze this, we use the same approach as before, and can recursively define the run time \(T(n) = T(n-1) + T(n-2) + 1\) (the last term is the summation). Ironically, it takes more operations to find the nth term of the sequence this way than the magnitude of the nth term itself, and the Fibonacci sequence already grows exponentially.

The way around this is to write the program to reuse the same work done previously. This usually involves using a looping construct, or higher order functions in functional programming languages. Doing this in Python is relatively straightforward.

def fib_loop(n):
   a,b = 1,1
   for _ in range(n):
      a,b = b,a+b
   return a

It is straightforward to verify its correctness, and it only comprises of a loop iterating \(n\) times. Using fib_loop, we can calculate the 250,000th term of the sequence in the time it took fib_bad to calculate the 30th. This does not mean that explicit recursion is always a poor choice, but rather that there are better implementations in some instances.

Merge sort

Merge sort is a divide and conquer algorithm for sorting a list. The following is a Haskell implementation of merge sort.

mergeSort []  = []
mergeSort [x] = [x]
mergeSort xs  = merge ls rs
  where
    n  = length xs
    ls = take (n `div` 2) xs
    rs = drop (n `div` 2) xs

merge []     rs     = rs
merge ls     []     = ls
merge (l:ls) (r:rs) | l < r     = l : merge ls     (r:rs)
                    | otherwise = r : merge (l:ls) rs

Like any good comparison-based sorting algorithm, sorting a list with merge sort takes \(O(n \log n)\) time to sort \(n\) elements. However, what makes merge sort useful is its ability to sort enormous lists that don’t fit in RAM.

Sorting large lists

Suppose we want to sort a list that cannot fit inside of RAM — for example, we might want to sort a 100 gigabyte file on our hard drive. Merge sort can handle this situation quite naturally. We choose a block size \(B\) of memory that can comfortably fit inside RAM, and process the file in chunks of size \(B\). We sort each \(B\)-sized chunk and write it to another file. Once we have sorted all \(B\)-sized chunks, we then begin merging the files into a sorted file. Pseudo-code is below.

LARGE-MERGE-SORT(infile, outfile)
  // Sorting phase
  let fcount = 0
  while infile still has contents
    let x = the next B items from infile
    let y = SORT(x)
    write y to file fcount
    let fcount = fcount + 1

  // Merging phase
  for i = 0 to fcount
    let f(i) be a file handle to file i
  K-WAY-MERGE([f(0), f(1), ... f(fcount)], outfile)

K-WAY-MERGE(files, outfile)
  while length files > 0
    let i = arg min { peek next element of f | for all f in files }
    let x = read next element of files[i]

    if files[i] exhausted
      remove element i from files

    write x to outfile

Here we’ve relied on some in-memory sorting algorithm SORT. This can be any sorting algorithm you like, such as merge sort or quicksort.

The correctness of this algorithm is readily apparent, since its structure exactly follows merge sort. However, we note that we never have more than \(O(B)\) memory being used at any given point.

Parallelizing the sort

The LARGE-MERGE-SORT algorithm described above has natural variants which can easily be parallelized. First, we can simply rely on an implementation of SORT which is itself parallel. However, we can do better. Disk I/O can be quite slow, so if we choose \(B\) such that two \(B\)-sized blocks can fit in main memory, we can be calling SORT on one block while reading the second block into memory. Frameworks such as Hadoop take this a step further, distributing large chunks of the file to difference machines to be sorted in parallel, and then handles the merge back into a single file.