Robert Floyd's Tiny and Beautiful Algorithm for Sampling without Replacement

During the big Powerball drawing last week, I put together a simple web app to let you see how long it might take for your numbers to win. While a simple thing, it is surprising to see how long it generally takes to hit the power ball.  At odds of about one in 175 million, playing twice a week, it takes roughly 1,700,000 years to win, on average.  Sure, someone's going to win (and did last week), but I imagine that there were probably at least 175 million tickets sold.

Anyway, for the little javascript web app, I of course needed to be able to simulate drawing the numbers for the lottery, many many times.

This is a remarkably elementary and basic thing to need to do, right? A few lines of code real quick on the way to getting something going. The small irritation is that this is sampling without replacement, so you need to make sure and deal with potential repeats. No big deal, just kind of irritating.

I came across an elegant little algorithm due to late Robert Floyd for choosing a set S of M unique random samples from a population of size N:
    initialize set S to empty
    for J := N-M + 1 to N do
      T := RandInt(1, J)
      if T is not in S then
        insert T in S
      else
        insert J in S
For the case of drawing the first five numbers for the powerball, this reduces to
    initialize set S to empty
    for J := 55 to 59 do
      T := RandInt(1, J)
      if T is not in S then
        insert T in S
      else
        insert J in S

Does this really work? You'll get five different numbers in the right range, but is it statistically equivalent to drawing five numbers without replacement? I saw a proof by induction and some other fairly technical proofs, but I think the basic idea is intuitive.

We will do five draws, and every number has 5 chances to be drawn. However, for some of the draws, the numbers above 55 have more than once chance to be drawn.

Let ni be the ith number drawn.
  • First draw (n1), a random number between 1 and 55
  • Second draw (n2), a random number between 1 and 56, with 56 having TWO chances to be drawn:
    • if 56 is drawn
    • if n1 is drawn
  • Third draw (n3), a random number between 1 and 57, with 57 having THREE chances to be drawn:
    • if 57 is drawn
    • if n1 is drawn
    • if n2 is drawn
  • Fourth draw (n4), a random number between 1 and 58, with 58 having FOUR chances to be drawn:
    • if 58 is drawn
    • if n1 is drawn
    • if n2 is drawn
    • if n3 is drawn
  • Fifth draw (n5), a random number between 1 and 59, with 59 having FIVE chances to be drawn:
    • if 59 is drawn
    • if n1 is drawn
    • if n2 is drawn
    • if n3 is drawn
    • if n4 is drawn
So, each number has five chances to be drawn:
  • The numbers at or below 55 have a chance to be drawn each of the five draws
  • The number 56 has two chances to be drawn on the second draw, and one for each of the three draws thereafter
  • The number 57 has three chances to be drawn on the third draw, and one for each of the two draws thereafter
  • The number 58 has four chances to be drawn on the fourth draw, and one the last draw
  • The number 59 has five chances to be drawn on the fifth and final draw
This is of course trivial to implement, too.

The tiny algorithm seems a wonderful and efficient thing, and I can see why it would be included in a Jon Bentley's Programming Pearls (which I discovered via this post on StackOverflow).

And Robert Floyd was an amazing person.  A child prodigy.  Without a Ph.D., he became chairman of the computer science department at Stanford in the 70's, and significantly influenced the likes of Don Knuth, who in turn had Robert Sedgewick as a doctoral student, who himself is now championing "Algorithms for the Masses" from Princeton via Coursera courses online.

It is cool what one can learn from implementing little projects.

No comments:

Post a Comment

Popular Posts