# Reservoir sampling¶

Reservoir sampling is an algorithm to choose a random \(k\)-size subset of \(N\) elements, where \(N\) is very large and possibly unknown. Throughout we assume that \(k \le N\).

## Warmup: small \(N\)¶

Suppose we have a list of \(N\) elements from which we would like a random sample of size \(k\) drawn uniformly at random. If \(N\) is known and small enough that the list of \(N\) elements can fit into main memory, we could find our subset as follows. Generate a random permutation of the \(N\) elements using a Fisher-Yates shuffle, and then choose the first \(k\) elements from the list. Since we began with a random permutation of all \(N\) elements, the probability of any \(k\)-size subset being in the first \(k\) elements is the same. The Fisher-Yates shuffle is shown below.

```
SHUFFLE(xs)
let n = length xs
for i = 1 to n - 1
let j = draw uniformly at random from {i, i + 1, ..., n}
swap xs[i], xs[j]
```

This algorithm takes \(O(n)\) time to shuffle \(n\) elements. We can improve the runtime to \(O(k)\) where we only need to shuffle the first \(k\) elements as follows.

```
SHUFFLE(xs, k)
let n = length xs
for i = 1 to k
let j = draw uniformly at random from {i, i + 1, ..., n}
swap xs[i], xs[j]
```

The sampling algorithm proposed so far takes \(O(k)\) time and \(O(N)\) space, and requires knowledge of the value of \(N\).

## Large or unknown \(N\)¶

Our algorithm proceeds as follows. As we process each element of the list we maintain a reservoir of \(k\) elements which represents our current choice of a random size \(k\) subset should the list have no additional elements. Whenever we encounter a new element in the list, we replace an element of the reservoir with the new element with the appropriate probability. The pseudocode is as follows.

```
SAMPLE(k, list)
// Initialization; get the first k elements
for i = 1 to k
let r[i] = list[i]
// Inductive step
let i = k + 1
while list has an element at i
let x = list[i]
with probability k / i
let j = draw uniformly at random from {1, 2, ..., k}
let r[j] = x
let i = i + 1
```

To prove that the algorithm is correct, we maintain the following inductive invariant: after viewing \(i\) elements, our reservoir contains a size \(k\) subset of the first \(i\) elements drawn uniformly at random.

To begin, we note that the invariant holds for \(i = k\) because of our initialization routine. We now show that it holds for general \(i \ge k\). Let \(i > k\) be given, and assume the invariant has held up to processing element \(i - `\). Since there are \(i\) elements in total and we are selecting \(k\), the probability that any one element is in the subset is \(k / i\). Our algorithm clearly chooses the new element for inclusion with the appropriate probability. Additionally, each of the first \(i - 1\) elements are in the subset with probability \(k / (i - 1)\) by our invariant. Any such element is replaced by the new element with probability \((k / i) / k = 1 / i\). Thus the probability any one of the first \(i - 1\) elements are in the subset after this iteration is \(k / (i - 1) \cdot (1 - 1 / i) = k / i\), as required.

We have shown that the marginal probability of any one element being in the subset is correct; however, we need to show that the probability of any \(k\)-size subset being chosen is equal. Let any size \(k\) subset of the first \(i\) elements be given. We need that the probability of choosing this subset is \(\binom{i}{k}^{-1}\). There are two cases to consider: either element \(i\) is in the subset, or not. In the first case we have from our invariant that the probability of this subset being chosen is

as required. Otherwise, we need all but the \(i\text{th}\) element to be in the previous subset, which happens with probability \((i - k) \binom{i - 1}{k}^{-1}\) since there are \(i - k\) \(k\)-size subsets containing the first \(k - 1\) elements of our group. We then need to replace the one bad element with the new element, which happens with probability \((k / i) \cdot (1 / k)\). In total, the probability we have the given subset after the \(i\text{th}\) element has been processed is

as required.