Policy learning using historical observations is an important problem that has found widespread applications. Examples include offers, prices, selection of advertisements to send to customers, and selection of medications to prescribe to patients. However, the existing literature is based on the important assumption that the future environment in which the learned policy is deployed is the same as the past environment that generated the data. This assumption is often wrong or too crude an approximation. In this paper, we aim to lift this assumption and learn distributionally robust policies using imperfect observational data. First, we present a policy evaluation procedure that can evaluate how well the policy performs under worst-case environmental changes. We then establish a central limit theorem-type guarantee for this proposed policy evaluation scheme. Leveraging this evaluation scheme, we further propose a novel learning algorithm capable of learning robust policies against adversarial perturbations and unknown covariate shifts, with performance guarantees based on the theory of uniform convergence. Finally, we empirically test the effectiveness of the proposed algorithm on synthetic datasets and show that it provides the missing robustness using standard policy learning algorithms. We conclude this paper by providing a comprehensive application of the method in the context of real-world voting datasets.

Source link


Leave A Reply