# Week 4: Randomized algorithms

## Part I: Sorting and selection

Last week, we introduced Mergesort, a recursive algorithm for sorting an array of $$n$$ elements in time $$O(n \log n)$$. Today we will examine another recursive algorithm whose worst-case running time can be quadratic, but its average-case running time is at least as good as the runtime of Mergesort. We will also introduce a closely related algorithm for the selection problem: finding the $$k$$-th largest element of an unsorted array (e.g. finding the median of a collection of numbers). Both algorithms are randomized and rely on a common partitioning procedure that we will analyze below.

### Quicksort

The sorting algorithm is called quicksort and relies on the following idea: suppose we have a procedure to split an array $$A$$ into two subarrays, $$A_{\ell}$$ and $$A_h$$, such that every element in $$A_{\ell}$$ is less than every element in $$A_h$$ (but $$A_{\ell}, A_h$$ themselves may be unsorted). Then we can completely sort $$A$$ by applying the same procedure to the $$A_{\ell}, A_h$$ recursively. Indeed, quicksort applies this idea to $$A$$ as below:

1. Pick a "pivot" element $$x \in A$$, and partition $$A$$ into two subarrays:
2. $$A^{-} := \{z \in A \mid z \leq x \}$$
3. $$A^{+} := \{ z \in A \mid z > x \}$$
4. Recursively apply the partitioning algorithm to $$A^{-}, A^{+}$$.

Some observations about the algorithm above are in order:

• The partitioning step can be done in $$O(n)$$ time in place (no extra memory needed).
• Moreover, partitioning the matrix can be done via a sequence of swaps, which have efficient low-level implementations.
• Unless the choice of pivot is very bad on multiple occasions, the subproblems generated have similar sizes and result in a runtime asymptotically similar to mergesort.

#### The partition algorithm

To motivate the partition algorithm, let us examine a very small example: $$A = [1, 10, 9, 3, 4]$$. Suppose that we choose the last element, $$4$$, to become our pivot. Then, a very natural way of partitioning $$A$$ into elements less than or equal to the pivot and the complement is the following:

• read the array "from left to right", and compare each element we read with the pivot $$x$$.
• if the element is less than or equal to $$x$$, "send" it to the front of the array
• if the element is larger than $$x$$, move to the next element.

This is similar to the formal partition algorithm we will introduce below. At a high level, the algorithm maintains two indices $$i$$ and $$j$$, that satisfy (given $$A[1, \dots, n]$$):

• any elements from $$j$$ to $$n$$ have not been read yet.
• any elements from $$1$$ to $$i$$ have been found to be less than or equal to the pivot $$x$$.
• any elements between $$i + 1$$ and $$j$$ are bigger than the pivot $$x$$.

We initialize $$i = 0$$ and $$j = 1$$, and increment $$j$$ at every step, doing the following:

1. if $$A[j] \leq x$$, we "swap" $$A[i+1]$$ with $$A[j]$$. Then we increment $$j := j + 1, i := i + 1$$.
2. if $$A[j] > x$$, we just increment $$j := j + 1$$.
###### Example

Here is an example of the partition algorithm in action:

The following block includes Python code for the partition algorithm. Here, we also include the start and end indices $$p$$ and $$r$$ (inclusive) at which the algorithm will operate, to avoid creating a copy of $$A$$. It returns the index of the split, i.e., the last index in $$A[p, \dots, r]$$ containing an element less than the chosen pivot. The pivot below is always chosen as $$A[r]$$, but can be randomized by swapping a randomly chosen element with $$A[r]$$ and then running the algorithm below.

def partition(A, p, r):
pivot = A[r]
i = p - 1
for j in range(p, r):      # stop at index r-1
if A[j] <= pivot:
i += 1
A[i], A[j] = A[j], A[i]
i += 1
A[i], A[r] = A[r], A[i]    # bring pivot element to front
return i

With partition at hand, we are ready to give a recursive implementation of quicksort:

def quicksort_aux(A, p, r):
if p < r:
q = partition(A, p, r)
quicksort_aux(A, p, q-1)
quicksort_aux(A, q+1, r)

def quicksort(A):
quicksort_aux(A, 0, len(A) - 1)

#### Complexity of quicksort

The complexity of quicksort with randomly chosen pivot is:

• $$O(n \log n)$$ expected time complexity, which is often observed in practice
• $$O(n^2)$$ worst-case time complexity, which is rarely the case in practice

The memory/space complexity is $$\Theta(n)$$. Importantly, and in contrast to mergesort, the above algorithm never allocates any additional arrays.

### Quickselect

Before we introduce quickselect, we formally state the selection problem.

$$\texttt{Select}(A, k)$$: given an unsorted array $$A$$ with $$n$$ elements, find its $$k$$-th smallest element.

Examples:

1. $$k = 1$$: find the minimum element of $$A$$
2. $$k = n$$: find the maximum element of $$A$$
3. $$k = \frac{n}{2}$$: find the median of $$A$$.

Finding the minimum and maximum elements of $$A$$ takes time $$\Theta(n)$$, by looping over the array and keeping track of the minimum / maximum encountered so far. The selection problem for arbitrary $$k$$ does not admit such an obvious algorithm, but does admit a worst-case $$O(n)$$ algorithm. Here, we will present an expected-time $$O(n)$$ algorithm, using the same partitioning procedure that we built quicksort on.

To demonstrate, suppose $$k = \frac{n}{2}$$, so that we are looking for the median of the array. Again, pick a pivot $$x \in A$$ and partition $$A$$ into $$A^{-}$$ and $$A^{+}$$ like before. Then, observe the following:

• if $$| A^{-} | > \frac{n}{2}$$ , there are at least $$n / 2$$ elements $$\le x$$, so the median must also be $$\le x$$, and located in $$A^{-}$$.
• if $$| A^{+} | > \frac{n}{2}$$, there are at least $$n / 2$$ elements $$> x$$, so the median must also be $$> x$$, and located in $$A^{+}$$.

You should take some time to check these properties yourselves. If you are convinced that they hold, note that the above hints towards a recursive algorithm. In the first case, we can solve the median finding problem on $$A^{-}$$; however, in the second case, since we will be discarding the $$|A^{-}|$$ smallest elements of the original array, we are no longer looking for the median of $$A^{+}$$, but rather we are looking for the $$(\frac{n}{2} - |A^{-}|)$$-th smallest element of $$A^{+}$$.

The Python code for quickselect follows below. As with quicksort, we need to know the start and end indices, p and r, of the subarray we are processing at each step.

def quickselect_aux(A, p, r, k):
if p == r:
return A[p]
i = partition(A, p, r)          # index of the split
s = i - p + 1                   # size of lower half of partition
if k == s:
return A[i]
elif k < s:
return quickselect_aux(A, p, i - 1, k)
else:
return quickselect_aux(A, i+1, r, k - s)

def quickselect(A, k):
return quickselect_aux(A, 0, len(A) - 1, k)

#### Complexity of quickselect

The quickselect algorithm requires a linear amount of memory, $$\Theta(n)$$, and does not need to allocate any additional memory (just like quicksort). Its time complexity ranges between $$O(n)$$ and $$O(n^2)$$:

• Worst-case: $$O(n^2)$$ - even when $$k = 1$$!
• Average-case: $$O(n)$$ - sequence of bad pivots very unlikely.

Note: It turns out that choosing the pivot in quickselect can be done in a principled way that guarantees a more-or-less even split into subproblems. The procedure is called median-of-medians and the resulting algorithm bootstraps itself using quickselect. You can find more details in these slides.

## Part 2: Algorithms for data streams

We now consider algorithms for data streams. So far, all algorithms we have examined operate under the following assumptions:

1. the input size (i.e. number of "items" in the input) is bounded for some integer $$n$$, and
2. the aforementioned number $$n$$ is small enough so that it is possible to store the data in memory.

However, data streams often violate both of these assumptions. Formally, a data stream is a sequence of items

$x_0, x_1, x_2, \dots$

which are presented to us in a sequential fashion (by convention, the subscript $$i$$ in $$x_i$$ denotes the time step at which $$x_i$$ becomes available). Moreover, even when the stream is finite, it is common to assume its size is prohibitive to allow storing it in its entirety.

• Sampling: an important task is selecting $$k \ll n$$ elements at random from a (huge) stream of length $$n$$. This task is closely related to query optimization in databases.
• Counting: counting the number of events in a stream (e.g. number of items $$x_i$$ falling in an interval $$[a, b]$$) or the number of distinct elements in a stream.

Because of the aforementioned limitations, memory / space complexity becomes important here!

### Sampling elements from a stream

Let us try and motivate an important algorithm for sampling, by working our way up from the simplest approach.

#### A naive algorithm

Suppose that the stream has length $$n$$ and that we are actually able to store it in its entirety. Then the obvious algorithm is to store the stream in an array $$A$$, and generate $$k$$ indices in $$\{1, \dots, n\}$$ at random. However, this suffers from the obvious limitations:

1. as $$n \to \infty$$, storing becomes more and more "expensive", even though $$k$$ may remain fixed
2. the above algorithm does not efficiently produce "intermediate" solutions: if someone queries us for $$k$$ random elements among the stream encountered so far, we must generate $$k$$ "fresh" random indices. However, we would like to have to do minimal preprocessing to return an intermediate solution, even if the stream is not over.

#### A better solution using random tags

To arrive at a better solution, we make the following observation: suppose that each element was instead a random number, uniformly distributed in $$[0, 1]$$. In that case, selecting $$k$$ elements at random would be equivalent to selecting the $$k$$ smallest elements, since every subset of size $$k$$ has the same probability of being the subset of $$k$$ smallest elements.

This general principle will be very useful for the forthcoming algorithm. Indeed, we adopt the following approach: we generate a random "tag" (number uniformly distributed in $$[0, 1]$$) for each element $$x_i$$ in the stream at the time it becomes available. Since all tags are i.i.d., we have reduced our problem to maintaining the $$k$$ smallest elements of a stream over time. This is possible using a max-heap.

The algorithm is outlined below.

Algorithm: 1. Initialize a max-heap to hold elements, using their tags as keys. 1. At step $$i$$, generate $$\mathsf{tag}_i \sim \mathrm{Unif}[0, 1]$$. 1. For all $$i \in \{0, \dots, k - 1\}$$, insert the elements $$(\mathsf{tag}_i, x_i)$$ into the heap. 1. For any $$i \geq k$$, "peek" at the maximum-tag element of the heap. Then: * if the maximum tag is smaller than $$\mathsf{tag}_i$$, leave the heap unchanged. * otherwise, remove the max-tag element from the heap and insert $$(\mathsf{tag}_i, x_i)$$.

###### Example

Consider a stream [1, 5, 10, 13, 9] and suppose we want to sample $$k = 2$$ elements at random. We go through the following steps:

1. Initialize heap = [].
2. Next element: 1 - $$\mathsf{tag}_0 = 0.1$$. Insert into heap: heap = [(0.1, 1)].
3. Next element: 5 - $$\mathsf{tag}_1 = 0.71$$. Insert into heap: heap = [(0.71, 5), (0.1, 1)].
4. Next element: 10 - $$\mathsf{tag}_2 = 0.27$$. Since we already have $$k$$ elements in the heap, run check:
5. $$\mathsf{tag}_2 < 0.71$$, the max-tag in the heap. Therefore, we pop and insert: heap = [(0.27, 10), (0.1, 1)].
6. Next element: 13 - $$\mathsf{tag}_3 = 0.39$$. The maximum tag in the heap is smaller, so we leave it unchanged.
7. Next element: 9 - $$\mathsf{tag}_4 = 0.07$$. Run check:
8. $$\mathsf{tag}_4 < 0.27$$, the max-tag in the heap. Therefore, pop and insert: heap = [(0.1, 1), (0.07, 9)].
9. Return the two elements from the heap, since the stream is over: [1, 9].

Let us now examine the complexity of the above algorithm. * Time: $$O(\log k)$$ work at each step, since the most costly operations are popping / inserting an item. * Space: $$\Theta(k)$$, since we are only storing the heap. This is optimal.

However, the $$O(\log k)$$ time spent on each step looks a little bit awkward. After all, we are only replacing one element at a time! Indeed, there is an algorithm doing $$O(1)$$ work at each step.

#### Reservoir sampling

The reservoir sampling algorithm avoids the use of random tags and attains the optimal time and space complexity.

Let us first illustrate the $$k = 1$$ case. We initialize a temporary storage $$R = [x_0]$$. When $$x_n$$ arrives:

• with probability $$\frac{1}{n}$$, we replace the element in storage with $$x_n$$: $$R = [x_n]$$.
• with probability $$1 - \frac{1}{n}$$, we throw $$x_n$$ away.

Note

To simulate an event that happens with $$\frac{1}{n}$$ probability, we can use the following trick:

1. Generate a random number $$Z \sim \mathrm{Unif}[0, 1]$$.
2. Return $$\mathbf{1}\{Z \leq \frac{1}{n}\}$$.

The indicator function above will be $$1$$ with probability $$\frac{1}{n}$$.

The above admits a straightforward generalization for $$k$$ elements. We initialize a temporary storage $$R$$ using the first $$k$$ elements of the stream. Then:

• when $$x_n$$ arrives, we replace one element in $$R$$ with $$x_n$$, with probability $$\frac{k}{n}$$.
• the element in $$R$$ that is replaced by $$x_n$$ is chosen uniformly at random.

An example follows below.

###### Example

In the example below, $$k = 1$$. Stream is shown on the left (blue indicates next element), and temporary storage is shown on the right.

##### Why reservoir sampling works

We sketch the proof for the case $$k = 1$$ below. We wish to prove that having read $$n$$ elements from the stream, every element appears in our storage $$R$$ with equal probability $$\frac{1}{n}$$. The proof is by induction on $$n$$, and generalizes in a straightforward manner for $$k > 1$$.

1. Base case: $$n = 1$$. This is trivial, since we always include the first element in the storage.

2. Inductive hypothesis: suppose the claim is true up to some arbitrary index $$n \in \mathbb{N}$$, i.e. $$\mathbb{P}(R = [x_i]) = \frac{1}{n}, \; \forall i \leq n$$.

3. Step: consider the next index $$n + 1$$. Then:

4. $$\mathbb{P}(R = [x_{n+1}]) = \frac{1}{n+1}$$, since we always include $$x_i$$ with prob. $$\frac{1}{i}$$.

5. the probability of leaving $$R$$ unchanged is $$1 - \frac{1}{n+1} = \frac{n}{n+1}$$. By our inductive hypothesis, every other element appears in $$R$$ at time step $$n$$ with probability $$\frac{1}{n}$$. Therefore, the probability that a fixed element $$x_i \neq x_{n+1}$$ appears in $$R$$ at step $$n + 1$$ is $$\frac{n}{n+1} \cdot \frac{1}{n} = \frac{1}{n+1}$$.