Skip to content

Week 4: Randomized algorithms

Part I: Sorting and selection

Last week, we introduced Mergesort, a recursive algorithm for sorting an array of \( n \) elements in time \( O(n \log n) \). Today we will examine another recursive algorithm whose worst-case running time can be quadratic, but its average-case running time is at least as good as the runtime of Mergesort. We will also introduce a closely related algorithm for the selection problem: finding the \(k\)-th largest element of an unsorted array (e.g. finding the median of a collection of numbers). Both algorithms are randomized and rely on a common partitioning procedure that we will analyze below.

Quicksort

The sorting algorithm is called quicksort and relies on the following idea: suppose we have a procedure to split an array \( A \) into two subarrays, \( A_{\ell} \) and \( A_h \), such that every element in \( A_{\ell} \) is less than every element in \( A_h \) (but \( A_{\ell}, A_h \) themselves may be unsorted). Then we can completely sort \( A \) by applying the same procedure to the \( A_{\ell}, A_h \) recursively. Indeed, quicksort applies this idea to \( A \) as below:

  1. Pick a "pivot" element \(x \in A\), and partition \(A\) into two subarrays:
  2. \( A^{-} := \{z \in A \mid z \leq x \} \)
  3. \( A^{+} := \{ z \in A \mid z > x \} \)
  4. Recursively apply the partitioning algorithm to \( A^{-}, A^{+} \).

Some observations about the algorithm above are in order:

  • The partitioning step can be done in \( O(n) \) time in place (no extra memory needed).
  • Moreover, partitioning the matrix can be done via a sequence of swaps, which have efficient low-level implementations.
  • Unless the choice of pivot is very bad on multiple occasions, the subproblems generated have similar sizes and result in a runtime asymptotically similar to mergesort.

The partition algorithm

To motivate the partition algorithm, let us examine a very small example: \( A = [1, 10, 9, 3, 4] \). Suppose that we choose the last element, \(4\), to become our pivot. Then, a very natural way of partitioning \( A \) into elements less than or equal to the pivot and the complement is the following:

  • read the array "from left to right", and compare each element we read with the pivot \( x \).
  • if the element is less than or equal to \( x \), "send" it to the front of the array
  • if the element is larger than \( x \), move to the next element.

This is similar to the formal partition algorithm we will introduce below. At a high level, the algorithm maintains two indices \( i \) and \( j \), that satisfy (given \(A[1, \dots, n]\)):

  • any elements from \(j\) to \(n\) have not been read yet.
  • any elements from \(1\) to \(i\) have been found to be less than or equal to the pivot \(x\).
  • any elements between \(i + 1\) and \(j\) are bigger than the pivot \(x\).

We initialize \(i = 0\) and \(j = 1\), and increment \(j\) at every step, doing the following:

  1. if \(A[j] \leq x\), we "swap" \(A[i+1]\) with \(A[j]\). Then we increment \(j := j + 1, i := i + 1\).
  2. if \(A[j] > x\), we just increment \(j := j + 1\).
Example

Here is an example of the partition algorithm in action:

image alt ><

image alt ><

image alt ><

image alt ><

image alt ><

image alt ><

image alt ><

image alt ><

image alt ><

image alt ><

The following block includes Python code for the partition algorithm. Here, we also include the start and end indices \(p\) and \(r\) (inclusive) at which the algorithm will operate, to avoid creating a copy of \(A\). It returns the index of the split, i.e., the last index in \( A[p, \dots, r] \) containing an element less than the chosen pivot. The pivot below is always chosen as \(A[r]\), but can be randomized by swapping a randomly chosen element with \(A[r]\) and then running the algorithm below.

def partition(A, p, r):
    pivot = A[r]
    i = p - 1
    for j in range(p, r):      # stop at index r-1
        if A[j] <= pivot:
            i += 1
            A[i], A[j] = A[j], A[i]
    i += 1
    A[i], A[r] = A[r], A[i]    # bring pivot element to front
    return i

With partition at hand, we are ready to give a recursive implementation of quicksort:

def quicksort_aux(A, p, r):
    if p < r:
        q = partition(A, p, r)
        quicksort_aux(A, p, q-1)
        quicksort_aux(A, q+1, r)

def quicksort(A):
    quicksort_aux(A, 0, len(A) - 1)

Complexity of quicksort

The complexity of quicksort with randomly chosen pivot is:

  • \( O(n \log n) \) expected time complexity, which is often observed in practice
  • \( O(n^2) \) worst-case time complexity, which is rarely the case in practice

The memory/space complexity is \( \Theta(n) \). Importantly, and in contrast to mergesort, the above algorithm never allocates any additional arrays.

Quickselect

Before we introduce quickselect, we formally state the selection problem.

\( \texttt{Select}(A, k) \): given an unsorted array \(A\) with \(n\) elements, find its \(k\)-th smallest element.

Examples:

  1. \(k = 1\): find the minimum element of \(A\)
  2. \(k = n\): find the maximum element of \(A\)
  3. \(k = \frac{n}{2}\): find the median of \(A\).

Finding the minimum and maximum elements of \(A\) takes time \(\Theta(n)\), by looping over the array and keeping track of the minimum / maximum encountered so far. The selection problem for arbitrary \( k \) does not admit such an obvious algorithm, but does admit a worst-case \(O(n)\) algorithm. Here, we will present an expected-time \(O(n)\) algorithm, using the same partitioning procedure that we built quicksort on.

To demonstrate, suppose \(k = \frac{n}{2}\), so that we are looking for the median of the array. Again, pick a pivot \(x \in A\) and partition \(A\) into \(A^{-}\) and \(A^{+}\) like before. Then, observe the following:

  • if \( | A^{-} | > \frac{n}{2} \) , there are at least \( n / 2 \) elements \( \le x \), so the median must also be \(\le x\), and located in \(A^{-}\).
  • if \( | A^{+} | > \frac{n}{2} \), there are at least \( n / 2 \) elements \(> x\), so the median must also be \(> x\), and located in \(A^{+}\).

You should take some time to check these properties yourselves. If you are convinced that they hold, note that the above hints towards a recursive algorithm. In the first case, we can solve the median finding problem on \( A^{-} \); however, in the second case, since we will be discarding the \( |A^{-}| \) smallest elements of the original array, we are no longer looking for the median of \( A^{+} \), but rather we are looking for the \( (\frac{n}{2} - |A^{-}|) \)-th smallest element of \( A^{+} \).

The Python code for quickselect follows below. As with quicksort, we need to know the start and end indices, p and r, of the subarray we are processing at each step.

def quickselect_aux(A, p, r, k):
    if p == r:
        return A[p]
    i = partition(A, p, r)          # index of the split
    s = i - p + 1                   # size of lower half of partition
    if k == s:
        return A[i]
    elif k < s:
        return quickselect_aux(A, p, i - 1, k)
    else:
        return quickselect_aux(A, i+1, r, k - s)

def quickselect(A, k):
    return quickselect_aux(A, 0, len(A) - 1, k)

Complexity of quickselect

The quickselect algorithm requires a linear amount of memory, \(\Theta(n)\), and does not need to allocate any additional memory (just like quicksort). Its time complexity ranges between \(O(n)\) and \(O(n^2)\):

  • Worst-case: \(O(n^2)\) - even when \(k = 1\)!
  • Average-case: \(O(n)\) - sequence of bad pivots very unlikely.

Note: It turns out that choosing the pivot in quickselect can be done in a principled way that guarantees a more-or-less even split into subproblems. The procedure is called median-of-medians and the resulting algorithm bootstraps itself using quickselect. You can find more details in these slides.

Part 2: Algorithms for data streams

We now consider algorithms for data streams. So far, all algorithms we have examined operate under the following assumptions:

  1. the input size (i.e. number of "items" in the input) is bounded for some integer \(n\), and
  2. the aforementioned number \(n\) is small enough so that it is possible to store the data in memory.

However, data streams often violate both of these assumptions. Formally, a data stream is a sequence of items

\[ x_0, x_1, x_2, \dots \]

which are presented to us in a sequential fashion (by convention, the subscript \(i\) in \(x_i\) denotes the time step at which \(x_i\) becomes available). Moreover, even when the stream is finite, it is common to assume its size is prohibitive to allow storing it in its entirety.

Examples: some common tasks on data streams follow below:

  • Sampling: an important task is selecting \(k \ll n\) elements at random from a (huge) stream of length \(n\). This task is closely related to query optimization in databases.
  • Counting: counting the number of events in a stream (e.g. number of items \(x_i\) falling in an interval \([a, b]\)) or the number of distinct elements in a stream.

Because of the aforementioned limitations, memory / space complexity becomes important here!

Sampling elements from a stream

Let us try and motivate an important algorithm for sampling, by working our way up from the simplest approach.

A naive algorithm

Suppose that the stream has length \(n\) and that we are actually able to store it in its entirety. Then the obvious algorithm is to store the stream in an array \(A\), and generate \(k\) indices in \(\{1, \dots, n\}\) at random. However, this suffers from the obvious limitations:

  1. as \(n \to \infty\), storing becomes more and more "expensive", even though \(k\) may remain fixed
  2. the above algorithm does not efficiently produce "intermediate" solutions: if someone queries us for \(k\) random elements among the stream encountered so far, we must generate \(k\) "fresh" random indices. However, we would like to have to do minimal preprocessing to return an intermediate solution, even if the stream is not over.

A better solution using random tags

To arrive at a better solution, we make the following observation: suppose that each element was instead a random number, uniformly distributed in \([0, 1]\). In that case, selecting \(k\) elements at random would be equivalent to selecting the \(k\) smallest elements, since every subset of size \(k\) has the same probability of being the subset of \(k\) smallest elements.

This general principle will be very useful for the forthcoming algorithm. Indeed, we adopt the following approach: we generate a random "tag" (number uniformly distributed in \([0, 1]\)) for each element \(x_i\) in the stream at the time it becomes available. Since all tags are i.i.d., we have reduced our problem to maintaining the \(k\) smallest elements of a stream over time. This is possible using a max-heap.

The algorithm is outlined below.

Algorithm: 1. Initialize a max-heap to hold elements, using their tags as keys. 1. At step \(i\), generate \(\mathsf{tag}_i \sim \mathrm{Unif}[0, 1]\). 1. For all \(i \in \{0, \dots, k - 1\}\), insert the elements \((\mathsf{tag}_i, x_i)\) into the heap. 1. For any \(i \geq k\), "peek" at the maximum-tag element of the heap. Then: * if the maximum tag is smaller than \(\mathsf{tag}_i\), leave the heap unchanged. * otherwise, remove the max-tag element from the heap and insert \((\mathsf{tag}_i, x_i)\).

Example

Consider a stream [1, 5, 10, 13, 9] and suppose we want to sample \(k = 2\) elements at random. We go through the following steps:

  1. Initialize heap = [].
  2. Next element: 1 - \(\mathsf{tag}_0 = 0.1\). Insert into heap: heap = [(0.1, 1)].
  3. Next element: 5 - \(\mathsf{tag}_1 = 0.71\). Insert into heap: heap = [(0.71, 5), (0.1, 1)].
  4. Next element: 10 - \(\mathsf{tag}_2 = 0.27\). Since we already have \(k\) elements in the heap, run check:
  5. \(\mathsf{tag}_2 < 0.71\), the max-tag in the heap. Therefore, we pop and insert: heap = [(0.27, 10), (0.1, 1)].
  6. Next element: 13 - \(\mathsf{tag}_3 = 0.39\). The maximum tag in the heap is smaller, so we leave it unchanged.
  7. Next element: 9 - \(\mathsf{tag}_4 = 0.07\). Run check:
  8. \(\mathsf{tag}_4 < 0.27\), the max-tag in the heap. Therefore, pop and insert: heap = [(0.1, 1), (0.07, 9)].
  9. Return the two elements from the heap, since the stream is over: [1, 9].

Let us now examine the complexity of the above algorithm. * Time: \(O(\log k)\) work at each step, since the most costly operations are popping / inserting an item. * Space: \(\Theta(k)\), since we are only storing the heap. This is optimal.

However, the \(O(\log k)\) time spent on each step looks a little bit awkward. After all, we are only replacing one element at a time! Indeed, there is an algorithm doing \(O(1)\) work at each step.

Reservoir sampling

The reservoir sampling algorithm avoids the use of random tags and attains the optimal time and space complexity.

Let us first illustrate the \(k = 1\) case. We initialize a temporary storage \(R = [x_0]\). When \(x_n\) arrives:

  • with probability \(\frac{1}{n}\), we replace the element in storage with \(x_n\): \(R = [x_n]\).
  • with probability \(1 - \frac{1}{n}\), we throw \(x_n\) away.

Note

To simulate an event that happens with \(\frac{1}{n}\) probability, we can use the following trick:

  1. Generate a random number \(Z \sim \mathrm{Unif}[0, 1]\).
  2. Return \(\mathbf{1}\{Z \leq \frac{1}{n}\}\).

The indicator function above will be \(1\) with probability \(\frac{1}{n}\).

The above admits a straightforward generalization for \(k\) elements. We initialize a temporary storage \(R\) using the first \(k\) elements of the stream. Then:

  • when \(x_n\) arrives, we replace one element in \(R\) with \(x_n\), with probability \(\frac{k}{n}\).
  • the element in \(R\) that is replaced by \(x_n\) is chosen uniformly at random.

An example follows below.

Example

In the example below, \(k = 1\). Stream is shown on the left (blue indicates next element), and temporary storage is shown on the right.

image alt ><

Why reservoir sampling works

We sketch the proof for the case \(k = 1\) below. We wish to prove that having read \(n\) elements from the stream, every element appears in our storage \(R\) with equal probability \(\frac{1}{n}\). The proof is by induction on \(n\), and generalizes in a straightforward manner for \(k > 1\).

  1. Base case: \(n = 1\). This is trivial, since we always include the first element in the storage.

  2. Inductive hypothesis: suppose the claim is true up to some arbitrary index \(n \in \mathbb{N}\), i.e. \(\mathbb{P}(R = [x_i]) = \frac{1}{n}, \; \forall i \leq n\).

  3. Step: consider the next index \(n + 1\). Then:

  4. \(\mathbb{P}(R = [x_{n+1}]) = \frac{1}{n+1}\), since we always include \(x_i\) with prob. \(\frac{1}{i}\).

  5. the probability of leaving \(R\) unchanged is \(1 - \frac{1}{n+1} = \frac{n}{n+1}\). By our inductive hypothesis, every other element appears in \(R\) at time step \(n\) with probability \(\frac{1}{n}\). Therefore, the probability that a fixed element \(x_i \neq x_{n+1}\) appears in \(R\) at step \(n + 1\) is \(\frac{n}{n+1} \cdot \frac{1}{n} = \frac{1}{n+1}\).

Further reading

  • Quicksort: Chapter 7 of "Introduction to Algorithms" (3rd ed.), Cormen et al. (2009).

  • Quickselect: Section 9.2 of "Introduction to Algorithms" (3rd ed.), Cormen et al. (2009).

  • Reservoir sampling