Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision Estimation Coefficient (DEC) was recently proposed by Foster et al. (2021) as a necessary and sufficient complexity measure for sample-efficient regret-free RL. In this white paper, we move towards a unified theory of RL using the DEC framework. First, he proposes two new DEC-type complexity measures, exploratory DEC (EDEC) and unrewarded DEC (RFDEC). We show that they are necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original his DECs that capture only regret-free learning. We then design a new unified sample-efficient algorithm for all three learning objectives. Our algorithm uses a powerful and general model estimation subroutine to instantiate a variant of the E2D (Estimation-To-Decisions) meta-algorithm. Our algorithm E2D-TA improves on his Foster et al. algorithm, even in a no-repentance setting. (2021) This should limit variants of his DEC that can be very large, or design problem-specific estimation subroutines. As an application, we basically use a single algorithm to recover existing and obtain new sample-efficient learning results for a wide range of tractable RL problems. We also generalize DEC to provide a sample-efficient algorithm for estimating all policy models with an application for learning equilibria in Markov games. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on posterior sampling or maximum likelihood estimation and find that they are similar to his E2D-TA under similar structural conditions to his DEC. Shows that you enjoy your limits.

Source link


Leave A Reply