Merge sort¶
Merge sort is a divide-and-conquer style algorithm for sorting a list. The following is a Haskell implementation of merge sort.
mergeSort [] = []
mergeSort [x] = [x]
mergeSort xs = merge ls rs
where
n = length xs
ls = take (n `div` 2) xs
rs = drop (n `div` 2) xs
merge [] rs = rs
merge ls [] = ls
merge (l:ls) (r:rs) | l < r = l : merge ls (r:rs)
| otherwise = r : merge (l:ls) rs
Like any good comparison-based sorting algorithm, sorting a list with merge sort takes \(O(n \log n)\) time to sort \(n\) elements. However, what makes merge sort useful is its ability to sort enormous lists that don’t fit in RAM.
Sorting large lists¶
Suppose we want to sort a list that cannot fit inside of RAM — for example, we might want to sort a 100 gigabyte file on our hard drive. Merge sort can handle this situation quite naturally. We choose a block size \(B\) of memory that can comfortably fit inside RAM, and process the file in chunks of size \(B\). We sort each \(B\)-sized chunk and write it to another file. Once we have sorted all \(B\)-sized chunks, we then begin merging the files into a sorted file. Pseudo-code is below.
LARGE-MERGE-SORT(infile, outfile)
// Sorting phase
let fcount = 0
while infile still has contents
let x = the next B items from infile
let y = SORT(x)
write y to file fcount
let fcount = fcount + 1
// Merging phase
for i = 0 to fcount
let f(i) be a file handle to file i
K-WAY-MERGE([f(0), f(1), ... f(fcount)], outfile)
K-WAY-MERGE(files, outfile)
while length files > 0
let i = arg min { peek next element of f | for all f in files }
let x = read next element of files[i]
if files[i] exhausted
remove element i from files
write x to outfile
Here we’ve relied on some in-memory sorting algorithm SORT
. This
can be any sorting algorithm you like, such as merge sort or
quicksort.
The correctness of this algorithm is readily apparent, since its structure exactly follows merge sort. However, we note that we never have more than \(O(B)\) memory being used at any given point.
Parallelizing the sort¶
The LARGE-MERGE-SORT
algorithm described above has natural
variants which can easily be parallelized. First, we can simply rely
on an implementation of SORT
which is itself parallel. However, we
can do better. Disk I/O can be quite slow, so if we choose \(B\)
such that two \(B\)-sized blocks can fit in main memory, we can be
calling SORT
on one block while reading the second block into
memory. Frameworks such as Hadoop take this a step further,
distributing large chunks of the file to difference machines to be
sorted in parallel, and then handles the merge back into a single
file.