# ORIE 4520 - Stochastics at Scale

#### Course Description

We will study a collection of interesting stochastic algorithms and models, that serve to illustrate the following idea:

*Size is a critical consideration in the design of useful stochastic models and algorithms.*

A formal way to express this is through the notion of *scaling* - given a model/algorithm for some system, how does it behave when we grow some aspect of the system. This course will try to build intuition behind the importance of scaling by presenting examples where understanding scaling is crucial for good system design.

- Course Syllabus: pdf
- Introductory lecture

#### Homeworks

- Homework 1 (Due at 12 noon on Tuesday, 8th September)
- Homework 2 (Due at 12 noon on Friday, 18th September)
- Homework 3 (Due at 12 noon on Monday, 28th September)
- Homework 4 (Due at 12 noon on Wednesday, 7th October)
- Homework 5 (Due at 12 noon on Tuesday, 20th October)
- Homework 6 (Due at 12 noon on Wednesday, 18th November)
- Homework 7 (Due at 12 noon on Friday, 4th December)

#### Lectures

- Unit 1: Basic probability and randomized algorithms (Notes)

- Lecture 1: Introduction, linearity of expectation, indicator random variables, coupon collector
- Lecture 2: Birthday paradox, law of total probability and deferred decisions, verifying matrix multiplication (Freivald's algorithm)
- Recitation 1: More applications of linearity of expectation - Fixed points of random permutations (derangements), triangles in random graphs, k-length runs in DNA sequences
- Lecture 3: Quicksort, introduction to graph cuts
- Lecture 4: Simple randomized Max-cut, Min-cut (the CONTRACT algorithm)
- Lecture 5: Analysis of CONTRACT, the Karger-Stein algorithm

- Unit 2: Concentration inequalities

- Lecture 6: Markov's and Chebyshev's inequalities, union bound, concentration for coupon collector
- Lecture 7: Randomized selection
- Lecture 8: The Chernoff bound, basic “sampling theorem”
- Recitation 2: Completed analysis of randomized sampling, concentration for Quicksort
- Lecture 9: Chernoff bound for Rademacher variables (random walk on a line), randomized set balancing
- Lecture 10: Maximum loaded bin

- Unit 3: Randomized algorithms for big data

- Lecture 11: Hashing - introduction, chaining, fingerprinting, the Bloom filter
- Lecture 12: Hashing (contd) - Bloom filter analysis, perfect hashing, the FKS hashing scheme
- Lecture 13: Consistent Hashing, intro to distributed hash tables (Wikipedia articles: Bloom filter; FKS scheme; Consistent hashing)
- Lecture 14: Locality Sensitive Hashing - introduction and similarity metrics, LSH for Hamming distance
- Lecture 15: Locality Sensitive Hashing - Jaccard similarity, MinHash, probability amplification
- Lecture 16: Locality Sensitive Hashing for L
^{2}distance, the Johnson-Lindenstrauss Lemma
Some references:
- The chapter on similar items in the MMDS textbook has a good introduction to the topic - Section 3.6 onwards covers LSH. You can also see the slides from Jure Leskovec's course - part 1, part 2.
- Wikipedia articles: LSH; MinHash; JL Lemma.
- For the most recent information on LSH, see Alex Andoni's website on LSH and its applications.
- Lecture 17: Streaming - Reservoir sampling, Morris' approximate counting
- Lecture 18: Streaming - Frequency estimation, heavy-hitters (the Count-Min sketch)
- Lecture 19: Streaming - Moment estimation (the AMS sketch), distinct elements (Flajolet-Martin sketch) Some references:
- The chapter on streaming in the MMDS textbook, and associated slides - part 1, part 2. These are a great introduction, although there is some mismatch between the topics covered.
- Wikipedia articles: Streaming algorithms; Reservoir sampling; Morris counter; Count-Min sketch.
- Website on the Count-Min sketch and its applications.
- Lecture 20: F2 estimation via linear sketches. Introduced graph sketches
- Lecture 21: All-pairs distance sketch (from the Das Sarma et al. paper)

- Prelim exam (Solutions)
- Unit 4: Large-scale stochastic models and phase transitions

- Lecture 22: The Galton-Watson branching process
- Lecture 23: Phase transition in the GW branching process
- Lecture 24: DFS in the GW branching process. The giant component in the G(n,p) random graph
- Lecture 25: Revision of discrete-time Markov Chains. The Gambler's Ruin problem
- Lecture 26: Revision of the Poisson process and continuous-time Markov Chains
- Lecture 27: Rumor spreading in a clique (the SI epidemic model)
- Lecture 28: The SIS epidemic: introduction
- Lecture 29: The SIS epidemic: phase-transition on the clique
- Lecture 30: The SIS epidemic: phase-transition on general graphs (via coupling and graph conductance)
- Lecture 31-32: CTMC transition-rate matrix and Kolmogorov equations
- Lecture 33: Diff-eqn approximations to MCs (Phase transition in the G(n,N) graph)
- Lecture 34: Diff-eqn approximations to MCs (SIS epidemic threshold and network spectral radius)
- Lecture 35: The power of two choices
- Lecture 36: Oblivious routing on hypercubes - the Valiant scheme
- Lecture 37: Review

#### Course Information

**Lectures**: MWF 1.25pm-2.15pm, Phillips 403, Map**Instructor**: Siddhartha Banerjee, 229 Rhodes Hall, email**Teaching Assistant**: Anna Srapionyan, email

#### Course Material

There is no required textbook for the course. Different topics will be covered from different sources, and notes and links to the relevant material will be periodically posted here. In particular, the course will cover selected topics from the following textbooks:- There are two good textbooks for the first unit of the course:
*Randomized Algorithms*by Rajeev Motwani and Prabhakar Raghavan*Probability and Computing*by Michael Mitzenmacher and Eli Upfal

- A good introduction for the material in the second unit is
*Mining of Massive Datasets*by Jure Leskovec, Anand Rajaraman and Jeff Ullman. An online version is freely available on the book's webpage. - A good introduction for some of the material in the third unit is
*Networks, Crowds and Markets*(Sections V, VI) by David Easley and Jon Kleinberg. For a more technical treatment, see*Epidemics and Rumours in Complex Networks*by Moez Draief and Laurent Massoulie. Online versions are again available.

#### Prerequisites

Knowledge of basic probability (at the level of ORIE 3500): in particular, random variables, conditional probability and expectation, common probability distributions and their properties (binomial, geometric, exponential, Poisson, Gaussian). The latter part of the course (after the prelim) will require knowledge of stochastic processes, in particular, Markov chains (at the level of ORIE 3510). There will be a recitation session covering the essentials, and students may be able to manage without the required background. Prior exposure to algorithms and graph theory will also be useful, but not essential. Send a mail to the instructor if you are concerned about having the appropriate prerequisites.