Week 4: Randomized algorithms
Part I: Sorting and selection
Last week, we introduced Mergesort, a recursive algorithm for sorting an array of \( n \) elements in time \( O(n \log n) \). Today we will examine another recursive algorithm whose worstcase running time can be quadratic, but its averagecase running time is at least as good as the runtime of Mergesort. We will also introduce a closely related algorithm for the selection problem: finding the \(k\)th largest element of an unsorted array (e.g. finding the median of a collection of numbers). Both algorithms are randomized and rely on a common partitioning procedure that we will analyze below.
Quicksort
The sorting algorithm is called quicksort
and relies on the following idea:
suppose we have a procedure to split an array \( A \) into two subarrays,
\( A_{\ell} \) and \( A_h \), such that every element in \( A_{\ell} \) is
less than every element in \( A_h \) (but \( A_{\ell}, A_h \) themselves
may be unsorted). Then we can completely sort \( A \) by applying the same
procedure to the \( A_{\ell}, A_h \) recursively. Indeed, quicksort applies
this idea to \( A \) as below:
 Pick a "pivot" element \(x \in A\), and partition \(A\) into two subarrays:
 \( A^{} := \{z \in A \mid z \leq x \} \)
 \( A^{+} := \{ z \in A \mid z > x \} \)
 Recursively apply the partitioning algorithm to \( A^{}, A^{+} \).
Some observations about the algorithm above are in order:
 The partitioning step can be done in \( O(n) \) time in place (no extra memory needed).
 Moreover, partitioning the matrix can be done via a sequence of swaps, which have efficient lowlevel implementations.
 Unless the choice of pivot is very bad on multiple occasions, the subproblems generated have similar sizes and result in a runtime asymptotically similar to
mergesort
.
The partition algorithm
To motivate the partition algorithm, let us examine a very small example: \( A = [1, 10, 9, 3, 4] \). Suppose that we choose the last element, \(4\), to become our pivot. Then, a very natural way of partitioning \( A \) into elements less than or equal to the pivot and the complement is the following:
 read the array "from left to right", and compare each element we read with the pivot \( x \).
 if the element is less than or equal to \( x \), "send" it to the front of the array
 if the element is larger than \( x \), move to the next element.
This is similar to the formal partition algorithm we will introduce below. At a high level, the algorithm maintains two indices \( i \) and \( j \), that satisfy (given \(A[1, \dots, n]\)):
 any elements from \(j\) to \(n\) have not been read yet.
 any elements from \(1\) to \(i\) have been found to be less than or equal to the pivot \(x\).
 any elements between \(i + 1\) and \(j\) are bigger than the pivot \(x\).
We initialize \(i = 0\) and \(j = 1\), and increment \(j\) at every step, doing the following:
 if \(A[j] \leq x\), we "swap" \(A[i+1]\) with \(A[j]\). Then we increment \(j := j + 1, i := i + 1\).
 if \(A[j] > x\), we just increment \(j := j + 1\).
Example
Here is an example of the partition algorithm in action:
The following block includes Python code for the partition algorithm. Here, we also include the start and end indices \(p\) and \(r\) (inclusive) at which the algorithm will operate, to avoid creating a copy of \(A\). It returns the index of the split, i.e., the last index in \( A[p, \dots, r] \) containing an element less than the chosen pivot. The pivot below is always chosen as \(A[r]\), but can be randomized by swapping a randomly chosen element with \(A[r]\) and then running the algorithm below.
def partition(A, p, r):
pivot = A[r]
i = p  1
for j in range(p, r): # stop at index r1
if A[j] <= pivot:
i += 1
A[i], A[j] = A[j], A[i]
i += 1
A[i], A[r] = A[r], A[i] # bring pivot element to front
return i
With partition
at hand, we are ready to give a recursive implementation of quicksort:
def quicksort_aux(A, p, r):
if p < r:
q = partition(A, p, r)
quicksort_aux(A, p, q1)
quicksort_aux(A, q+1, r)
def quicksort(A):
quicksort_aux(A, 0, len(A)  1)
Complexity of quicksort
The complexity of quicksort
with randomly chosen pivot is:
 \( O(n \log n) \) expected time complexity, which is often observed in practice
 \( O(n^2) \) worstcase time complexity, which is rarely the case in practice
The memory/space complexity is \( \Theta(n) \). Importantly, and in contrast to mergesort
,
the above algorithm never allocates any additional arrays.
Quickselect
Before we introduce quickselect
, we formally state the selection problem.
\( \texttt{Select}(A, k) \): given an unsorted array \(A\) with \(n\) elements, find its \(k\)th smallest element.
Examples:
 \(k = 1\): find the minimum element of \(A\)
 \(k = n\): find the maximum element of \(A\)
 \(k = \frac{n}{2}\): find the median of \(A\).
Finding the minimum and maximum elements of \(A\) takes time \(\Theta(n)\), by looping
over the array and keeping track of the minimum / maximum encountered so
far. The selection problem for arbitrary \( k \) does not admit such an
obvious algorithm, but does admit a worstcase \(O(n)\) algorithm. Here,
we will present an expectedtime \(O(n)\) algorithm, using the same partitioning
procedure that we built quicksort
on.
To demonstrate, suppose \(k = \frac{n}{2}\), so that we are looking for the median of the array. Again, pick a pivot \(x \in A\) and partition \(A\) into \(A^{}\) and \(A^{+}\) like before. Then, observe the following:
 if \(  A^{}  > \frac{n}{2} \) , there are at least \( n / 2 \) elements \( \le x \), so the median must also be \(\le x\), and located in \(A^{}\).
 if \(  A^{+}  > \frac{n}{2} \), there are at least \( n / 2 \) elements \(> x\), so the median must also be \(> x\), and located in \(A^{+}\).
You should take some time to check these properties yourselves. If you are convinced that they hold, note that the above hints towards a recursive algorithm. In the first case, we can solve the median finding problem on \( A^{} \); however, in the second case, since we will be discarding the \( A^{} \) smallest elements of the original array, we are no longer looking for the median of \( A^{+} \), but rather we are looking for the \( (\frac{n}{2}  A^{}) \)th smallest element of \( A^{+} \).
The Python code for quickselect
follows below. As with quicksort
, we need
to know the start and end indices, p
and r
, of the subarray we are processing
at each step.
def quickselect_aux(A, p, r, k):
if p == r:
return A[p]
i = partition(A, p, r) # index of the split
s = i  p + 1 # size of lower half of partition
if k == s:
return A[i]
elif k < s:
return quickselect_aux(A, p, i  1, k)
else:
return quickselect_aux(A, i+1, r, k  s)
def quickselect(A, k):
return quickselect_aux(A, 0, len(A)  1, k)
Complexity of quickselect
The quickselect
algorithm requires a linear amount of memory, \(\Theta(n)\),
and does not need to allocate any additional memory (just like quicksort
).
Its time complexity ranges between \(O(n)\) and \(O(n^2)\):
 Worstcase: \(O(n^2)\)  even when \(k = 1\)!
 Averagecase: \(O(n)\)  sequence of bad pivots very unlikely.
Note: It turns out that choosing the pivot in quickselect
can be done in
a principled way that guarantees a moreorless even split into subproblems.
The procedure is called medianofmedians and the resulting algorithm bootstraps
itself using quickselect
. You can find more details in these slides.
Part 2: Algorithms for data streams
We now consider algorithms for data streams. So far, all algorithms we have examined operate under the following assumptions:
 the input size (i.e. number of "items" in the input) is bounded for some integer \(n\), and
 the aforementioned number \(n\) is small enough so that it is possible to store the data in memory.
However, data streams often violate both of these assumptions. Formally, a data stream is a sequence of items
which are presented to us in a sequential fashion (by convention, the subscript \(i\) in \(x_i\) denotes the time step at which \(x_i\) becomes available). Moreover, even when the stream is finite, it is common to assume its size is prohibitive to allow storing it in its entirety.
Examples: some common tasks on data streams follow below:
 Sampling: an important task is selecting \(k \ll n\) elements at random from a (huge) stream of length \(n\). This task is closely related to query optimization in databases.
 Counting: counting the number of events in a stream (e.g. number of items \(x_i\) falling in an interval \([a, b]\)) or the number of distinct elements in a stream.
Because of the aforementioned limitations, memory / space complexity becomes important here!
Sampling elements from a stream
Let us try and motivate an important algorithm for sampling, by working our way up from the simplest approach.
A naive algorithm
Suppose that the stream has length \(n\) and that we are actually able to store it in its entirety. Then the obvious algorithm is to store the stream in an array \(A\), and generate \(k\) indices in \(\{1, \dots, n\}\) at random. However, this suffers from the obvious limitations:
 as \(n \to \infty\), storing becomes more and more "expensive", even though \(k\) may remain fixed
 the above algorithm does not efficiently produce "intermediate" solutions: if someone queries us for \(k\) random elements among the stream encountered so far, we must generate \(k\) "fresh" random indices. However, we would like to have to do minimal preprocessing to return an intermediate solution, even if the stream is not over.
A better solution using random tags
To arrive at a better solution, we make the following observation: suppose that each element was instead a random number, uniformly distributed in \([0, 1]\). In that case, selecting \(k\) elements at random would be equivalent to selecting the \(k\) smallest elements, since every subset of size \(k\) has the same probability of being the subset of \(k\) smallest elements.
This general principle will be very useful for the forthcoming algorithm. Indeed, we adopt the following approach: we generate a random "tag" (number uniformly distributed in \([0, 1]\)) for each element \(x_i\) in the stream at the time it becomes available. Since all tags are i.i.d., we have reduced our problem to maintaining the \(k\) smallest elements of a stream over time. This is possible using a maxheap.
The algorithm is outlined below.
Algorithm: 1. Initialize a maxheap to hold elements, using their tags as keys. 1. At step \(i\), generate \(\mathsf{tag}_i \sim \mathrm{Unif}[0, 1]\). 1. For all \(i \in \{0, \dots, k  1\}\), insert the elements \((\mathsf{tag}_i, x_i)\) into the heap. 1. For any \(i \geq k\), "peek" at the maximumtag element of the heap. Then: * if the maximum tag is smaller than \(\mathsf{tag}_i\), leave the heap unchanged. * otherwise, remove the maxtag element from the heap and insert \((\mathsf{tag}_i, x_i)\).
Example
Consider a stream [1, 5, 10, 13, 9]
and suppose we want to sample \(k = 2\) elements at random. We go through the following steps:
 Initialize
heap = []
.  Next element: 1  \(\mathsf{tag}_0 = 0.1\). Insert into heap:
heap = [(0.1, 1)]
.  Next element: 5  \(\mathsf{tag}_1 = 0.71\). Insert into heap:
heap = [(0.71, 5), (0.1, 1)]
.  Next element: 10  \(\mathsf{tag}_2 = 0.27\). Since we already have \(k\) elements in the heap, run check:
 \(\mathsf{tag}_2 < 0.71\), the maxtag in the heap. Therefore, we pop and insert:
heap = [(0.27, 10), (0.1, 1)]
.  Next element: 13  \(\mathsf{tag}_3 = 0.39\). The maximum tag in the heap is smaller, so we leave it unchanged.
 Next element: 9  \(\mathsf{tag}_4 = 0.07\). Run check:
 \(\mathsf{tag}_4 < 0.27\), the maxtag in the heap. Therefore, pop and insert:
heap = [(0.1, 1), (0.07, 9)]
.  Return the two elements from the heap, since the stream is over:
[1, 9]
.
Let us now examine the complexity of the above algorithm. * Time: \(O(\log k)\) work at each step, since the most costly operations are popping / inserting an item. * Space: \(\Theta(k)\), since we are only storing the heap. This is optimal.
However, the \(O(\log k)\) time spent on each step looks a little bit awkward. After all, we are only replacing one element at a time! Indeed, there is an algorithm doing \(O(1)\) work at each step.
Reservoir sampling
The reservoir sampling algorithm avoids the use of random tags and attains the optimal time and space complexity.
Let us first illustrate the \(k = 1\) case. We initialize a temporary storage \(R = [x_0]\). When \(x_n\) arrives:
 with probability \(\frac{1}{n}\), we replace the element in storage with \(x_n\): \(R = [x_n]\).
 with probability \(1  \frac{1}{n}\), we throw \(x_n\) away.
Note
To simulate an event that happens with \(\frac{1}{n}\) probability, we can use the following trick:
 Generate a random number \(Z \sim \mathrm{Unif}[0, 1]\).
 Return \(\mathbf{1}\{Z \leq \frac{1}{n}\}\).
The indicator function above will be \(1\) with probability \(\frac{1}{n}\).
The above admits a straightforward generalization for \(k\) elements. We initialize a temporary storage \(R\) using the first \(k\) elements of the stream. Then:
 when \(x_n\) arrives, we replace one element in \(R\) with \(x_n\), with probability \(\frac{k}{n}\).
 the element in \(R\) that is replaced by \(x_n\) is chosen uniformly at random.
An example follows below.
Example
In the example below, \(k = 1\). Stream is shown on the left (blue indicates next element), and temporary storage is shown on the right.
Why reservoir sampling works
We sketch the proof for the case \(k = 1\) below. We wish to prove that having read \(n\) elements from the stream, every element appears in our storage \(R\) with equal probability \(\frac{1}{n}\). The proof is by induction on \(n\), and generalizes in a straightforward manner for \(k > 1\).

Base case: \(n = 1\). This is trivial, since we always include the first element in the storage.

Inductive hypothesis: suppose the claim is true up to some arbitrary index \(n \in \mathbb{N}\), i.e. \(\mathbb{P}(R = [x_i]) = \frac{1}{n}, \; \forall i \leq n\).

Step: consider the next index \(n + 1\). Then:

\(\mathbb{P}(R = [x_{n+1}]) = \frac{1}{n+1}\), since we always include \(x_i\) with prob. \(\frac{1}{i}\).
 the probability of leaving \(R\) unchanged is \(1  \frac{1}{n+1} = \frac{n}{n+1}\). By our inductive hypothesis, every other element appears in \(R\) at time step \(n\) with probability \(\frac{1}{n}\). Therefore, the probability that a fixed element \(x_i \neq x_{n+1}\) appears in \(R\) at step \(n + 1\) is \(\frac{n}{n+1} \cdot \frac{1}{n} = \frac{1}{n+1}\).
Further reading

Quicksort: Chapter 7 of "Introduction to Algorithms" (3rd ed.), Cormen et al. (2009).

Quickselect: Section 9.2 of "Introduction to Algorithms" (3rd ed.), Cormen et al. (2009).