Mark E. Lewis
Industrial and Operations Engineering Department
University of Michigan
1205 Beal Avenue
Ann Arbor, MI 48109-2117
William L. Cooper
Department of Mechanical Engineering
University of Minnesota
111 Church Street S.E.
Minneapolis, MN 55455
Shane G. Henderson
Industrial and Operations Engineering Department
University of Michigan
1205 Beal Avenue
Ann Arbor, MI 48109-2117
Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing
optimal policies for Markov decision processes. At each iteration, rather than solving the average
evaluation equations, SBPI employs simulation to estimate a solution to these equations. For
unichain average-reward Markov decision processes with finite state and action spaces, we provide
easily-verifiable conditions that ensure that simulation-based policy iteration almost-surely eventually
never leaves the set of optimal decision rules. We analyze three simulation estimators for solutions to the
average evaluation equations. Using our general results, we derive simple conditions on the simulation
runlengths that guarantee the almost-sure converge of the algorithm.