Achieving sample efficiency in online episodic reinforcement learning (RL) requires an optimal balance between exploration and exploitation. For a finite-horizon episodic Markov decision process with states in $S$, actions in $A$, and horizon length $H$, $\sqrt{H^2SAT}$ (modulo logarithmic coefficient) $T$ is the total number of samples. Several competing solution paradigms have been proposed to minimize regret, but they are either memory inefficient or suboptimal unless the sample size exceeds a very large threshold. (eg $S^6A^4 \,\mathrm{poly}(H )$ for existing model-free methods).
To overcome such a large sample size barrier to efficient RL, the sample size is $SA\ ,\mathrm{poly}(H)$. This sample size requirement (also called initial burn-in cost) , our method improves over previous memory-efficient algorithms with asymptotic regret by a factor of at least $S^5A^3$. -Optimal. Leveraging a recently introduced variance reduction strategy (also known as {\em reference advantage factorization}), the proposed algorithm employs {\em Early-settled} reference update rules to provide upper and lower bound confidence Use two Q-learning sequences with boundary. The design principles of variance reduction methods that were resolved early are likely to be an independent concern for other RL settings with complex exploration-exploitation trade-offs.