Causal Bandits: Online Decision-Making in Endogenous Settings
52 Pages Posted: 22 Nov 2022 Last revised: 10 May 2024
Date Written: November 16, 2022
Abstract
The deployment of Multi-Armed Bandits (MAB) has become commonplace in many marketing applications. However, regret guarantees for even state-of-the-art linear bandit algorithms make strong exogeneity assumptions w.r.t. arm covariates, i.e., the covariates are uncorrelated with the unobserved random error. This assumption is very often violated in many practical contexts, and using such algorithms can lead to sub-optimal decisions. Further, many a time, marketers are also interested in the asymptotic distribution of estimated parameters. To this end, in this paper, we consider the problem of online learning in linear stochastic contextual bandit problems with endogenous covariates. We propose an algorithm we term $\epsilon$-\textit{BanditIV}, that uses instrumental variables to correct for this bias and prove an $\tilde{\mathcal{O}}(k\sqrt{T})$ upper bound for the expected regret of the algorithm, where $k$ is the dimension of the instrumental variable and $T$ is the number of rounds in the algorithm. Further, we demonstrate the asymptotic consistency and normality of the $\epsilon$-\textit{BanditIV} estimator. We carry out extensive Monte Carlo simulations to demonstrate the performance of our algorithms compared to other methods. We show that $\epsilon$-\textit{BanditIV} significantly outperforms other existing methods in endogenous settings. Finally, using daily paid app download data from iOS and Real-Time Bidding (RTB) data, we demonstrate how $\epsilon$-\textit{BanditIV} can be used to simultaneously optimize online decision-making and estimate the causal impact of price and advertising, respectively, in these settings. Comparisons show $\epsilon$-\textit{BanditIV} performs favorably against other methods.
Keywords: Multi-Armed Bandits, Causal Inference, Online Learning, Instrumental Variables
Suggested Citation: Suggested Citation