Note: The Tool is Online Here
I have been reading a bit recently about so-called Bayesian Bandits, as referred to by Ted Dunning. This problem involves the challenge of picking a strategy of playing N slot-machines some number of times, where the probability of winning for each slot machine is unknown. This problem has a number of interesting applications in display advertising, news article recommendation, and click-through-rate prediction (Agrawal and Goyal, 2012). As noted by Dunning, any solution must effectively handle the explore/exploit trade-off challenge. The implementation in this case - using Thompson sampling - is straightforward, and I can see how it could smoothly allow for quick updating as more data/information becomes available.
After seeing a video by Dunning showing a R simulation for this algorithm, and a simulation done by Cam Davidson-Pilon for his online book Bayesian Methods for Hackers at github, I put together a simple tool that lets you explore how the algorithm works. I started from the d3 code in Davidson-Pilon's example for the charts, although I use the jStat library for the beta distribution calculations, and web workers are used to run the simulation itself to keep the interface responsive.
Here's a screencast demo that's on youtube at http://www.youtube.com/watch?v=p09LI_jwAjs
An Online Tool To Explore Bayesian Bandits
You can watch as the algorithm "decides" on which bandit to pick next , and you can see how it reacts when you change the probabilities mid-stream (either by entering new values or dragging your mouse left/right in the input box).
Each Hit/Miss cell flashes and then fades from green/red as the bandit corresponding to that cell is tried, and either pays off or not. This serves to convey how the algorithm is either "exploring" or "exploiting".
As noted by Dunning, the algorithm can very quickly find the best bandit when the probabilities for each are disparate - it's impressive.
After many pulls, when the algorithm has honed in on the best bandit, it can be fairly slow to react to changes in the probabilities . A possible solution might be to have some mechanism that forces it to more easily "forget" behavior past a certain (potentially rolling) point in time (a la Davidson-Pilon's suggested learning rate approach). This would also apply to bandits it may have ruled out based on previous poor performance for many pulls, but which might have actually improved since it was last "given a chance". Graepel/Candela/Borchert/Herbrich (2010) at Microsoft, while using a (much) more sophisticated scheme involving Gaussians for click-through-rate prediction for sponsored search (apparently based to some degree on the TrueSkill Ranking system used to match players for XBox Live games), use a correction term that has the posterior distributions converge back to the prior with zero data and infinite time - i.e, poor performance is slowly forgotten and there will be better and better chances to "get back in the game" to try again.
Feel free to let me know if you get a chance to play with this as well, and/or there are issues with the implementation here.
- On July 10, 2013, steps were taken to deal with the apparent underflow issues with the beta function calculation for the displayed charts (everything is now calculated in log space). This allows the simulation to run for a longer time. The current maximum number of pulls is now 150,000 instead of 1500, which, since things are updated 10 times a second, corresponds to a simulation of a little more than four hours (clock-time). Though I'm not sure larger times would be needed for this tool in this context, this pull limit will likely be removed shortly assuming no further issues are seen with the larger values.