Convergence of Simulation-Based Policy Iteration

Mark E. Lewis
Industrial and Operations Engineering Department
University of Michigan
1205 Beal Avenue
Ann Arbor, MI 48109-2117

William L. Cooper
Department of Mechanical Engineering
University of Minnesota
111 Church Street S.E.
Minneapolis, MN 55455

Shane G. Henderson
Industrial and Operations Engineering Department
University of Michigan
1205 Beal Avenue
Ann Arbor, MI 48109-2117

Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing

optimal policies for Markov decision processes. At each iteration, rather than solving the average

evaluation equations, SBPI employs simulation to estimate a solution to these equations. For

unichain average-reward Markov decision processes with finite state and action spaces, we provide

easily-verifiable conditions that ensure that simulation-based policy iteration almost-surely eventually

never leaves the set of optimal decision rules. We analyze three simulation estimators for solutions to the

average evaluation equations. Using our general results, we derive simple conditions on the simulation

runlengths that guarantee the almost-sure converge of the algorithm.