Sunday, April 15, 2018

Published 9:00 PM by with 0 comment

Sticker Collecting Problem

If you buy a box of cereal and get a sticker, there are 5 unique sticker types, and they occur with equal probability, how many boxes of cereal do you need to buy to get all 5 sticker types?
I started thinking about this problem this morning for whatever reason, and it ended up being harder than I'd guessed it would be. I still haven't actually solved parts of it that I wanted but I have enough that it's worth writing up.

First...what's the expected number of boxes to buy? There are many interpretations of that question, but the simplest is literally what's the expectation value? You have 5 stickers that you need. The first box of cereal will give you one sticker, and since you have none it will obviously be unique. The next box will give you a sticker that has a 4/5 chance of being unique. This will happen until you get a second sticker. The third will have a 3/5 chance of being unique. This repeats until you are finished.

Since each box's result is independent and the probability for each iteration of this (first sticker, second sticker, etc.) is constant throughout the iteration, each of these follow a geometric distribution. Thus...the expected time for each iteration is just 1/probability. From the numbers above, that's 5/5 for the first sticker (i.e., first box gets it), 4/5 for the second one, and so on. That gives a total number of boxes equal to 5/5 + 5/4 + 5/3 + 5/2 + 5/1. Factoring out the 5, that's 5*(1 + 1/2 + 1/3 + 1/4 + 1/5). Since 5 is number of stickers and this holds for any number of stickers, replacing 5 with n gives n*sum(1/i) for i = 1 to n as the general expectation value. The table below has this calculated for several values of n:

# of stickersaverage # of cereal boxes
35.5
511.4
718.2
1029.3
1549.8

For a different interpretation...what's the most likely number of boxes you have to buy? If this were a normal distribution, you'd expect that to be the same as the previous table. It isn't. Below is the distribution from 50,000 trials buying cereal boxes until you have all 5 stickers:



Another metric is 'what number of cereal boxes yields the full set of stickers for at least 50% of the trials?' and I've plotted that below for the same criteria:



Finally...here's the same thing for 7 stickers.




It's interesting to note that both of those are less than the expectation value. This is because the distribution is shifted towards lower numbers of cereal boxes and in some cases you end up having to buy a huge number of cereal boxes. Averaging in the cases with the huge numbers of boxes shifts the expectation value to the right of the median.

Also worth noting that you obviously can't get n stickers in m boxes if m < n, so the first one that has any successes is the nth box. The probability of that working is just n/n * (n-1)/n*...*1/n. Written more concisely, that's n!/n^n. For 5 stickers, that works out to ~3.8% which is what the figure above shows. For 7 stickers, that's ~0.61%.

If you want to play around with this, you can start with my code:

python code here



      edit

0 comments:

Post a Comment