Merge sort

Merge sort is a divide-and-conquer style algorithm for sorting a list. The following is a Haskell implementation of merge sort.

mergeSort []  = []
mergeSort [x] = [x]
mergeSort xs  = merge ls rs
  where
    n  = length xs
    ls = take (n `div` 2) xs
    rs = drop (n `div` 2) xs

merge []     rs     = rs
merge ls     []     = ls
merge (l:ls) (r:rs) | l < r     = l : merge ls     (r:rs)
                    | otherwise = r : merge (l:ls) rs

Like any good comparison-based sorting algorithm, sorting a list with merge sort takes \(O(n \log n)\) time to sort \(n\) elements. However, what makes merge sort useful is its ability to sort enormous lists that don’t fit in RAM.

Sorting large lists

Suppose we want to sort a list that cannot fit inside of RAM — for example, we might want to sort a 100 gigabyte file on our hard drive. Merge sort can handle this situation quite naturally. We choose a block size \(B\) of memory that can comfortably fit inside RAM, and process the file in chunks of size \(B\). We sort each \(B\)-sized chunk and write it to another file. Once we have sorted all \(B\)-sized chunks, we then begin merging the files into a sorted file. Pseudo-code is below.

LARGE-MERGE-SORT(infile, outfile)
  // Sorting phase
  let fcount = 0
  while infile still has contents
    let x = the next B items from infile
    let y = SORT(x)
    write y to file fcount
    let fcount = fcount + 1

  // Merging phase
  for i = 0 to fcount
    let f(i) be a file handle to file i
  K-WAY-MERGE([f(0), f(1), ... f(fcount)], outfile)

K-WAY-MERGE(files, outfile)
  while length files > 0
    let i = arg min { peek next element of f | for all f in files }
    let x = read next element of files[i]

    if files[i] exhausted
      remove element i from files

    write x to outfile

Here we’ve relied on some in-memory sorting algorithm SORT. This can be any sorting algorithm you like, such as merge sort or quicksort.

The correctness of this algorithm is readily apparent, since its structure exactly follows merge sort. However, we note that we never have more than \(O(B)\) memory being used at any given point.

Parallelizing the sort

The LARGE-MERGE-SORT algorithm described above has natural variants which can easily be parallelized. First, we can simply rely on an implementation of SORT which is itself parallel. However, we can do better. Disk I/O can be quite slow, so if we choose \(B\) such that two \(B\)-sized blocks can fit in main memory, we can be calling SORT on one block while reading the second block into memory. Frameworks such as Hadoop take this a step further, distributing large chunks of the file to difference machines to be sorted in parallel, and then handles the merge back into a single file.