Contextual bandits are a modern go-to tool for proactive, sequential experimentation in the tech industry. They include online learning algorithms that adaptively (over time) learn policies that map observed contexts $X_t$ to actions $A_t$ in order to maximize the stochastic reward $R_t$. will be This adaptability poses interesting but difficult statistical inference problems, especially counterfactual problems. For example, it is often interesting to infer the properties of a virtual policy that differs from the logging policy used to collect the data. “Out of Policy Evaluation” (OPE). We present a comprehensive framework for OPE inference using modern martingale techniques. This relaxes many unnecessary assumptions made in past work and provides significant theoretical and empirical improvements. Our method is valid in a very general setting, where the logging policy itself is changed (for learning purposes) while the original experiment is still running (i.e. not necessarily after the fact). Can be used when possible. The distribution is drifting over time. More specifically, we derive confidence sequences for various functionals of interest in OPE. These include a doubly robust one for the time-varying out-of-policy mean reward values, but also confidence bands across the CDF of the out-of-policy reward distribution. All our methods are (a) valid for arbitrary stopping times, (b) make only nonparametric assumptions, (c) require no known bounds on the maximum importance weights, and (d) reward and the distribution that adapts to the empirical variance of the weights. In summary, our method uses adaptively collected contextual bandit data to enable out-of-policy inferences that are always valid.

Source link


Leave A Reply