Random Number Generators

There are many programming applications that make use of random numbers, including areas in statistics and simulation.

Pseudorandom number generators

Generating a truly random sequence of numbers is not possible without some sort of stochastic oracle. As a result, the goal is to generate a deterministic sequence of numbers that satisfies all the statistical properties of a random sequence. Routines that do this are referred to as pseudorandom number generators (RNGs).

There are many variants of RNGs, each with their own benefits and drawbacks. The simplest of these is a linear congruential generator (LCG). To generate a uniformly random \(b\)-bit integer, LCGs define the following recursion

\[X_{n+1} = ( aX_n + c ) \mod 2^b\]

for parameters \(a\) and \(c\), and an initial seed \(X_0\). Obviously, implementations like these are extremely efficient, but many suffer from autocorrelation problems if the parameters are not well-tuned. The Mersenne Twister is a more sophisticated RNG, and is among most prevalent.

Coding structure

Random number generators can complicate the structure and testing of one’s code. For instance, how can one test a function that outputs non-deterministic output? Here is a simple example that estimates \(\pi\) from estimating area ratio from a circle inscribed in a square.

import numpy as np

def monte_carlo_pi(iterations):

   total = 0
   for _ in range(iterations):
      vec = np.random.random(2)
      dst = np.linalg.norm(vec)
      total += (dst <= 1)

   return 4*total/iterations

Although this code is functional and short, it lacks modularity and testability. An alternative structure that has the same functionality is provided below.

import numpy as np

def inscribed(unifs):
   radii = np.linalg.norm(unifs,axis=1)
   bools = (radii <= 1)
   return bools

def pi_est(iterations,rng=None,seed=None):
    if rng is None:
       rng = np.random.RandomState(seed)

    unifs = rng.uniform(size=(iteration,2))
    bools = inscribed(unifs)

    return 4*np.mean(bools)

The updated version of the code contains numerous benefits compared to the previous version.

  • Using vector operations

    Given the option between loops and vector operations, the latter is almost always more efficient.

  • Using separate deterministic functions

    Modularity is king, and since deterministic functions are much easier to test, the best practice is to split code into as many smaller, deterministic subroutines as possible.

  • Passing the seed

    The most critical component to a valid non-deterministic routine is allowing the user to pass a seed to the random number generator. This allows the user to test the code more easily.

  • Passing the RNG object

    Sometimes, it is desirable to use a specific object to generate the pseudorandom variates. This allows the user the option to pass her own generator objects, giving the user the ability to fully control the sequence. For example, the user can pass a dummy object that always returns 0.2 when queried for a uniform random variable. This has the added benefit of avoiding the use of global variables.

These are basic guidelines for coding non-deterministic routines. Although this may only seem to work for smaller examples, these same rules can be applied on a much larger scale.

Random (sub)streams

Many times, there is not only one source of randomness, but many, and there should be a systematic way of dealing with these. To demonstrate a way to do this, we revisit the reservoir sampling example.

import numpy as np
from itertools import islice

def reservoir_sample(file_name,sample_size,replace_rng=None,
                     loc_rng=None,replace_seed=None,loc_seed=None):

   if replace_rng is None:
      replace_rng = np.random.RandomState(replace_seed)
   if loc_rng is None:
      loc_rng     = np.random.RandomState(loc_seed)

   sample = np.empty(sample_size,dtype=str)

   with open(file_name,'r') as infile:
      for k,line in enumerate(islice(infile,sample_size)):
         sample[k] = line

      for k,line in enumerate(islice(infile,sample_size,None)):
         replace_prob = replace_rng.uniform()
         loc_replace  = loc_rng.randint(sample_size)

         if replace_prob < sample_size/k:
            sample[loc_replace] = line

   return sample

This code uses two RNGs for two random variates. This is used for more control when testing. The cost for using two separate generators is ensuring that the pseudovariates satisfy the independence conditions of the two random variables. This is not an automatic guarantee, but there are pseudorandom generators that can guarantee this. In the simulation literature, this is equivalent to using two substreams. One example of a statistically safe implementation of random streams is RngStreams by Pierre L’Ecuyer.

Parallel computing

Any concerns regarding non-determinism in code should compound in the context of parallel computing. In many cases, the same non-deterministic code is evaluated multiple times to perform an estimate. However, independence between outcomes run in parallel can break down quickly.

For example, if the seed is not initialized by the user, it is usually initialized using the time on the local machine. Now imagine the difference between running non-deterministic code sequentially and running it in parallel. The sequential code will run as expected, but running the routine in parallel will result in nearly identical outputs because the initialized seeds will be nearly identical, due to the nearly identical start time.

To avoid situations like these, it is crucial to add a seed as input for non-deterministic code. Again, doing this for any RNG will not necessarily guarantee independence among variates. To handle this, a random stream should be passed to each instance of the routine. These random streams will have an independence guarantee. Further, any independent random variates in the routine should be granted a random substream from the passed random stream in each instance.