# Reservoir sampling¶

Reservoir sampling is an algorithm to choose a random $$k$$-size subset of $$N$$ elements, where $$N$$ is very large and possibly unknown. Throughout we assume that $$k \le N$$.

## Warmup: small $$N$$¶

Suppose we have a list of $$N$$ elements from which we would like a random sample of size $$k$$ drawn uniformly at random. If $$N$$ is known and small enough that the list of $$N$$ elements can fit into main memory, we could find our subset as follows. Generate a random permutation of the $$N$$ elements using a Fisher-Yates shuffle, and then choose the first $$k$$ elements from the list. Since we began with a random permutation of all $$N$$ elements, the probability of any $$k$$-size subset being in the first $$k$$ elements is the same. The Fisher-Yates shuffle is shown below.

SHUFFLE(xs)
let n = length xs
for i = 1 to n - 1
let j = draw uniformly at random from {i, i + 1, ..., n}
swap xs[i], xs[j]


This algorithm takes $$O(n)$$ time to shuffle $$n$$ elements. We can improve the runtime to $$O(k)$$ where we only need to shuffle the first $$k$$ elements as follows.

SHUFFLE(xs, k)
let n = length xs
for i = 1 to k
let j = draw uniformly at random from {i, i + 1, ..., n}
swap xs[i], xs[j]


The sampling algorithm proposed so far takes $$O(k)$$ time and $$O(N)$$ space, and requires knowledge of the value of $$N$$.

## Large or unknown $$N$$¶

Our algorithm proceeds as follows. As we process each element of the list we maintain a reservoir of $$k$$ elements which represents our current choice of a random size $$k$$ subset should the list have no additional elements. Whenever we encounter a new element in the list, we replace an element of the reservoir with the new element with the appropriate probability. The pseudocode is as follows.

SAMPLE(k, list)
// Initialization; get the first k elements
for i = 1 to k
let r[i] = list[i]

// Inductive step
let i = k + 1
while list has an element at i
let x = list[i]
with probability k / i
let j = draw uniformly at random from {1, 2, ..., k}
let r[j] = x
let i = i + 1


To prove that the algorithm is correct, we maintain the following inductive invariant: after viewing $$i$$ elements, our reservoir contains a size $$k$$ subset of the first $$i$$ elements drawn uniformly at random.

To begin, we note that the invariant holds for $$i = k$$ because of our initialization routine. We now show that it holds for general $$i \ge k$$. Let $$i > k$$ be given, and assume the invariant has held up to processing element $$i - `$$. Since there are $$i$$ elements in total and we are selecting $$k$$, the probability that any one element is in the subset is $$k / i$$. Our algorithm clearly chooses the new element for inclusion with the appropriate probability. Additionally, each of the first $$i - 1$$ elements are in the subset with probability $$k / (i - 1)$$ by our invariant. Any such element is replaced by the new element with probability $$(k / i) / k = 1 / i$$. Thus the probability any one of the first $$i - 1$$ elements are in the subset after this iteration is $$k / (i - 1) \cdot (1 - 1 / i) = k / i$$, as required.

We have shown that the marginal probability of any one element being in the subset is correct; however, we need to show that the probability of any $$k$$-size subset being chosen is equal. Let any size $$k$$ subset of the first $$i$$ elements be given. We need that the probability of choosing this subset is $$\binom{i}{k}^{-1}$$. There are two cases to consider: either element $$i$$ is in the subset, or not. In the first case we have from our invariant that the probability of this subset being chosen is

$\binom{i - 1}{k}^{-1} (1 - k / i) = \frac{k! (i - k - 1)!}{(i - 1)!} \cdot \frac{i - k}{i} = \frac{k! (i - k)!}{i!} = \binom{i}{k}^{-1},$

as required. Otherwise, we need all but the $$i\text{th}$$ element to be in the previous subset, which happens with probability $$(i - k) \binom{i - 1}{k}^{-1}$$ since there are $$i - k$$ $$k$$-size subsets containing the first $$k - 1$$ elements of our group. We then need to replace the one bad element with the new element, which happens with probability $$(k / i) \cdot (1 / k)$$. In total, the probability we have the given subset after the $$i\text{th}$$ element has been processed is

$(i - k) \binom{i - 1}{k}^{-1} \frac{k}{i} \cdot \frac{1}{k} = \frac{(i - k)}{i} \cdot \frac{k! (i - 1 - k)!}{(i - 1)!} = \frac{k! (i - k)!}{i!} = \binom{i}{k}^{-1},$

as required.