Musings on the Bound for Expected Regret for Thompson Sampling

I came across a talk at by Emilie Kaufman on Bayesian Bandits, and she mentioned the current upper bound on the expected regret for Thompson sampling, which I had not looked at in detail yet.  It was derived by Shipra Agrawal and Navin Goyal at Microsoft in this 2012 paper - see Theorem 2.  This bound is O(#ln(pulls)), but the bound is also inversely related to the difference between each sub-optimal bandit's probability of success and that of the optimal bandit.  This makes intuitive sense, and jives with what you see happen when the algorithm runs: when the optimal bandit is much better than the others, the algorithm converges very fast.  If some of the bandits have a probability of success close to that of the optimal bandit, it takes longer to converge, as it will try the (slightly) suboptimal bandits more frequently for a longer period, and hence regret increases.

If i is the difference between the ith bandit's probability of success and that of the optimal bandit, then in Agrawal and Goyal's paper the bound itself actually involves the square of the sum over 1/i2.   I have not looked in detail at the proof yet, but on its surface this specific dependence does not make as much sense to me, since if the probabilities are close to the optimal arm (and hence some of the i's are small), then there isn't much difference between the payoffs anyway.  Sure, it might take longer for the algorithm to find the optimal bandit, but it doesn't seem to be paying too much of a price to do so (assuming that the reward is proportional to the probability of success).  However, Agrawal and Goyal specifically state that "it seems harder to improve the dependence on the 's", so there are likely subtleties here.

Popular Posts