We study nonparametric estimates of the value function of the infinite-period $\gamma$ discounted Markov reward process (MRP) using observations from a single trajectory. Non-asymptotic guarantees for a general family of kernel-based multistep time-difference (TD) estimates, including regular $K$ step lookahead TDs for $K = 1, 2, \ldots$ and TD$ provide. The (\lambda)$ family is a special case of $\lambda \in[01)$.OurboundcapturesthedependenceontheBellmanvariabilityMarkovchainmixingtimemodelmisspecificationandthechoiceoftheweightingfunctionthatdefinestheestimatoritselfandthesubtletiesbetweenmixingtimeandmodelmisspecification.Revealinteractions.ForagivenTDmethodappliedtoawell-specifiedmodelthestatisticalerrorunderthetrajectorydataissimilartothatoftheiid-sampletransitionpairbutunderthewrongspecificationthedataThetimedependenceoftheinflatesthestatisticalerror.Howeversuchdegradationcanbemitigatedbyincreasinglook-ahead.ComplementingtheupperboundbyprovingaminimaxlowerboundthatestablishestheoptimalityoftheTD-basedmethodusingwell-chosenlookaheadsandweightstherearesomedifferencesbetweenvaluefunctionestimationandordinarynonparametricregression.revealthebasicdifferencesbetween



    Source link

    Share.

    Leave A Reply