Initial Thoughts on Multi-Arm Bandits and a Simple Solution

Note: I have implemented a simple online simulation tool here for exploring Bayesian Bandits.

I recently came across interesting applications of the so-called Multi-Arm Bandit problem.  This problem involves the challenge of picking a strategy of playing k slot-machines some number of times, where the probability of winning for each slot machine is unknown.  As this gels in my mind, I wanted to note some cool features of a solution for this... a solution that has been applied recently in google analytics for evaluating the benefits of changes to a web site, and as a suggested method to rank comments on the reddit site.

The multi-armed bandit problem itself seems focused on trying to figure out which particular bandit to play on your next turn, but as noted by James Neufeld, implicit in the solution is that there is also a ranking of bandits, and it is this ranking that can be utilized in other problem spaces.

The implementation - referred to as Thompson sampling or probability matching strategy - allows you to assume some prior distribution of the unknown success rates for each slot machine to begin with, and as each new play occurs, one obtains a more accurate estimate of the "true" success rate for the slot machine played.  The mathematics are simplified because the beta distribution is all that is needed in this particular case, and there is a simple and intuitive parametrization of the needed beta distributions based on the success/failure rate (either assumed, to start the process, or updated based on the successes seen so far).  At any time, we can obtain an estimate of the "true" success rate for each slot machine by sampling from its current beta distribution, and then rank them all by these values.  The one with the highest value would be the one you would "play next", but the ranking itself can be used to provide a snapshot ranking of all of the slot machines.  The random sampling from each slot machine that is performed each time we want to "see" the ranking results in a "jostling" that (I assume) prevents the ranking from getting too fixated: this is one way that the "explore" vs "exploit" dilemma is addressed.  It also appears that this approach very quickly pushes poor performers to the bottom of the list.

This all maps in a satisfying way to a couple of different problems
• For evaluating the effectiveness of changes to a web site by Google, each variation of the web site corresponds to a different slot machine.  Google uses (I assume) the conversion rate as the measure to find the "best" variant - Steven Scott at Google is one of the players here.
• For the case of ranking reddit comments on a post as suggested by James Neufeld, each comment itself corresponds to a different slot machine, and the probability of "hitting" that slot machine is estimated based on the number of up/down votes.  The ranking rather than just the "best" is of interest in this context
This recent blog post by Ted Dunning was the starting place for my current research.  He also has a sample in java (on github here) to show how easy it is to implement.  This article - Chapter 6 from Bayesian Methods for Hackers - includes more details on this approach, including a d3 visualization for a 3-bandit system.

There is much more to explore here.