We establish the strong law of large numbers and the central limit theorem for two of the most popular bandit algorithms, Thompson sampling and UCB regret. Here, our characterization of the regret distribution complements the tail characterization of the regret distribution recently developed by Fan and Glynn (2021) (arXiv:2109.13595). Tail characterization there is associated with atypical bandit behavior in trajectories where the average of the best arm is underestimated, leading to misidentification of the best arm and great regret. In contrast, SLLN and CLT here describe the typical behavior and variation of regret on trajectories for which the optimal arm mean is well estimated. We find that Thompson sampling and UCB satisfy the same SLLN and CLT, and the asymptote for both the SLLN and CLT (mean) centering sequences agrees with the expected regret asymptote. Both the mean and variance of the CLT increase by the rate $\log(T)$ over the time range $T$. Asymptotically as $T \to \infty$, the variability in the number of plays for each suboptimal arm depends only on the rewards received in that arm. This indicates that each suboptimal arm contributes independently to the overall her CLT variance.

Source link


Leave A Reply