Marginalized off-policy evaluation for reinforcement learning
Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) — the problem of evaluating a new policy using the historical data obtained by different behavior policies — under the model of nonstationary episodic Markov Decision Processes with a long horizon and large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon H. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of O(H^2R^2 max ∑[t=1:H] Eμ[(wπ, μ(st, at))2]/n) for large n, where wπ, μ(st,a t) is the ratio of the marginal distribution of the tth step under π and μ, H is the horizon, Rmax is the maximal rewards, and n is the sample size. The result nearly matches the Cramer-Rao lower bounds for DAG MDP in Jiang and Li  for most non-trivial regimes. To the best of our knowledge, this is the first OPE estimator with provably optimal dependence in H and the second moments of the importance weight. Besides theoretical optimality, we empirically demonstrate the superiority of our method in time-varying, partially observable, and long-horizon RL environments.