# Multisection: Parallelized Bisection

### Effectively Using Asynchronous Information

Stephen Pallone
Peter Frazier
Shane Henderson

School of Operations Research and Information Engineering,
Cornell University

# Bisection Algorithm

### Goal: Find the zero of a deterministic, one-dimensional monotonic function.

• We can use the classical bisection method to locate the zero.
• Bisection can be viewed as an optimization algorithm (e.g. finding where the derivative is zero).
• If we had multiple processors, how could we accelerate the algorithm?

# Bisection Algorithm

### Goal: Find the zero of a deterministic, one-dimensional monotonic function.

• We can use the classical bisection method to locate the zero.
• Bisection can be viewed as an optimization algorithm (e.g. finding where the derivative is zero).
• If we had multiple processors, how could we accelerate the algorithm?

# Bisection Algorithm

### Goal: Find the zero of a deterministic, one-dimensional monotonic function.

• We can use the classical bisection method to locate the zero.
• Bisection can be viewed as an optimization algorithm (e.g. finding where the derivative is zero).
• If we had multiple processors, how could we accelerate the algorithm?

# Multisection Algorithm

### Goal: Find the zero of a deterministic, one-dimensional monotonic function.

• Multisection: query several points simulaneously, reduce size of the interval at each iteration.
• We allow for the possibility of asynchronous function evaluations.
• How can we find an optimal querying strategy and immediately make use of asynchronous information?

# Multisection Algorithm

### Goal: Find the zero of a deterministic, one-dimensional monotonic function.

• Multisection: query several points simulaneously, reduce size of the interval at each iteration.
• We allow for the possibility of asynchronous function evaluations.
• How can we find an optimal querying strategy and immediately make use of asynchronous information?

# Multisection Algorithm

### Goal: Find the zero of a deterministic, one-dimensional monotonic function.

• Multisection: query several points simulaneously, reduce size of the interval at each iteration.
• We allow for the possibility of asynchronous function evaluations.
• How can we find an optimal querying strategy and immediately make use of asynchronous information?

# Modeling Computational States

### Keeping track of unfinished jobs and assigning new queries

• Let $$m$$ denote the number of unfinished jobs, and $$n$$ the total number of processors.
• If there are $$m$$ points currently being queried, we want to allocate the remaining $$n-m$$ cores.

# Modeling Computational States

### Keeping track of unfinished jobs and assigning new queries

• Let $$m$$ denote the number of unfinished jobs, and $$n$$ the total number of processors.
• If there are $$m$$ points currently being queried, we want to allocate the remaining $$n-m$$ cores.

# Modeling Computational States

### Keeping track of unfinished jobs and assigning new queries

• Let $$m$$ denote the number of unfinished jobs, and $$n$$ the total number of processors.
• If there are $$m$$ points currently being queried, we want to allocate the remaining $$n-m$$ cores.

# Modeling Computational States

### Keeping track of unfinished jobs and assigning new queries

• State Variables: Current interval and jobs running:
$$S = (S^{(0)},\dots,S^{(m+1)})$$, where $$S^{(i)}\leq S^{(i+1)}$$ for all $$i$$.
• The current interval is denoted by $$(S^{(0)},S^{(m+1)})$$, and the other components denote $$m$$ jobs that are still running.
• Decision Variables: Allocation of all cores:
$$X = (X^{(0)},\dots,X^{(n+1)})$$, where $$X^{(i)}\leq X^{(i+1)}$$ for all $$i$$.
• The current interval is denoted by $$(X^{(0)},X^{(n+1)})$$, and the other components denote the $$n$$ jobs that will be running after the allocation is made.

# Modeling Transitions

### How allocations transition to computational states

• We assume a uniform prior on the location of the root $$X_*$$ in the interval $$[0,1]$$.
• At decision epoch $$j$$, one or more function evaluations will complete.
• If we find at decision epoch $$j$$ that $$X_* \in (X^{(\ell_j)},X^{(u_j)})$$, terminate all jobs $$i$$ where $$X^{(i)} \geq X^{(u_j)}$$ or $$X^{(i)} \leq X^{(\ell_j)}$$.
• Let $$S_j = \left(X_{j-1}^{(i)}:\, X_{j-1}^{(\ell_j)} \leq X_{j-1}^{(i)} \leq X_{j-1}^{(u_j)} \right)$$.

# Defining the Objective Function

### Algorithmic Parameters

• Risk Parameter: $$r \in [0,1]$$
• Reward Function: $$R(S) = \|S\|^{-r}$$, where $$\|S\|$$ denotes the length of the interval.
• Random Exponentially Distributed Stopping Time $$\tau$$
• Computational Budget: $$\gamma = \mathbb{P}(\, \text{next arrival} < \tau\,) \in (0,1)$$

### Objective: given risk preference and computational budget, minimize interval length.

$\sup_\pi \mathbb{E}^\pi\left[ R(S_\tau) \right]$

# Modeling Waiting Times for Queries

### Deterministic and Equal Waiting Times

• Each function evaluation returns simultaneously at each decision epoch $$j$$.
• We immediately allocate all $$n$$ cores afterwards.

### IID Exponential Waiting Times

• Exactly one function evaluation returns with probability one.
• Any one of $$n$$ evaluations are equally likely to return with a value.
• Memoryless Property ensures waiting times are exponentially distributed with same parameter.

# Dynamic Programming

### Value Function

• By conditioning on the event that $$\tau$$ comes before the next arrival, we can write the optimal value function recursively as
• $V(S) = (1-\gamma)\, R(S) + \gamma\, \left( \max_{X \in \mathbb{X}(S)} \mathbb{E}\left[ \, V(S_1) \, | \, X_0 = X,\, S_0 = S \right] \right).$

### Intuition

• With probability $$(1-\gamma)$$, we run out of computational budget, and we earn reward $$R(S)$$.
• With probability $$\gamma$$, the next arrival comes before we run out of time, so we are able to reallocate all processors that finish.

# Theorem 1: Shifting and Scaling

Suppose that $$R(S) = \|S\|^{-r}$$ for $$r\in (0,1]$$. For $$c,h\in\mathbb{R}$$ with $$h>0$$, the value function $$V$$ has the following properties:

1. Shift Invariance: $$V(S+c) = V(S)$$
2. Inverse Scaling : $$V(h S) = h^{-r}\, V(S)$$
We can now focus solely on states $$S$$ with $$S^{(0)}=0$$ and $$S^{(m+1)}=1$$.

# Deterministic Waiting Times

### Key Observation: only one possible normalized state.

• For the case of deterministic waiting times, we reallocate all cores at every decision epoch because they finish simultaneously.

# Corollary 1: Equally Spaced Queries are Optimal

Suppose function evaluations take a deterministic and equal amount of time to compute, and let $$R(\cdot)$$ be defined as before. If $$X_j = (X_j^{(0)},X_j^{(1)},\dots,X_j^{(n+1)})$$ is defined as $X_j^{(i)} = A_j + \frac{i}{n+1} \left(B_j-A_j\right),$ then $$X_j$$ is an optimal allocation for $$S_j = (A_j,B_j)$$.

# IID Exponential Waiting Times

• For IID Exponential Waiting Times, with probability one, exactly one query will complete at each decision epoch $$j$$.
• Define the two states \begin{align} X_+^{(i)} &= \left(X^{(k)}:\, X^{(k)} \geq X^{(i)} \right) \\ X_-^{(i)} &= \left(X^{(k)}:\, X^{(k)} \leq X^{(i)} \right). \end{align}
• Exactly $$2n$$ possible outcomes: cut to the left, or cut to the right, for all $$n$$ queries.

# To Stack or Not to Stack

### Stacking Policies

• A stacking policy is one that allows for using multiple processors to query the same point.
• Allows for allocations $$X$$ such that $$X^{(i)} = X^{(i+1)}$$ for some $$i=0,\dots,n$$.

### Why would anyone do this?

• Useful for when waiting times are extremely variable (e.g. exponentially distributed).
• Incentive: ideally, all jobs focusing a single query would be terminated early when the first of these cores finishes.
• Ex. cloud computing, where core reliability and processing times are highly variable.

# Theorem 3: Midpoint Stacking Policy

When stacking is allowed, stacking all queries at the midpoint is the optimal allocation for independent exponentially distribution waiting times with uniform rates. Equivalently, for states $$S_j = (A_j,B_j)$$, the optimal allocation $$X_j$$ is such that $X_j^{(i)} = A_j + \frac{1}{2}(B_j - A_j) \quad\quad \forall i=1,\dots,n.$

# Golden Section Policy (n=2)

### We can always choose allocation $$X_{GS}$$ by setting

$$X_{GS}^{(1)} = \frac{1}{2}(3 - \sqrt{5}) \quad\quad X_{GS}^{(2)} = \frac{1}{2}(\sqrt{5}-1).$$

### Is it an optimal non-stacking policy?

• For $$r=1$$ (risk-neutral), the Golden Section Policy is an optimal allocation.
• For $$r<1$$ (risk-averse), it may not be optimal, but performs extremely well.
• Reminiscent of Golden Section Search (Kiefer 1953).

# Larger Numbers of Cores (n>2)

### Can we find policies similar to Golden Section for larger values of $$n$$?

• In general, we cannot rely on one allocation to define a policy.
• What if we instead look for a policy that visits a finite number of states?

### Consider the case n=3:

$$Y = \left(0,\frac{1}{3},\frac{1}{2},\frac{2}{3},1\right) \quad\quad Z = \left(0,\frac{1}{4},\frac{1}{2},\frac{3}{4},1 \right)$$

# Concluding Remarks

### Complications of Asynchronous Function Evaluations

• There is a trade-off between maintaining a desirable spread of queries and aggressively reducing the size of the current interval.
• Parameters such as risk-aversion and computational budget do not play a role in typical bisection, yet are important for determining optimal allocations for Multisection.
• Waiting time models can greatly affect optimal policy structure.

### Future Research

• Consider more realistic and complex waiting times.
• Define a framework for a similar algorithm for stochastic root-finding.