# [Tutor] Birthday paradox (was: Re: Random number generator (was: Can anyone help me?))

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Sun Oct 30 00:14:49 CEST 2005

```
On Fri, 28 Oct 2005, Johan Geldenhuys wrote:

> After I tested the previous code, I noticed that the odds is 1:49 that a
> duplicate number can be found in the 6 digit range (and it happended)
> and that 0 can also be found.

Hi Johan,

I'm not exactly sure how 1 in 49 are the odds of getting a duplication.

Just to make sure: the situation is that we're rolling a fifty-sided dice
six times, and we'd like to see what's the probability of having
duplicates in the mix.  We have 50**6 possible rolls, but there are only:

50 * 49 * 48 * 47 * 46 * 45

ways of making sure we don't hit duplicates.  The probability that we
don't hit any duplicates, then, is:

######
>>> (50 * 49 * 48 * 47 * 46 * 45) / (50 **6)
0L
#####

Oops, forgot about integer division.  *grin*

Let me try that again:

######
>>> (50.0 * 49 * 48 * 47 * 46 * 45) / (50 **6)
0.73224345599999996
######

Ok, so we have about a 73% chance of not having any duplicate numbers if
we roll a 50-digit dice six times.

Since we can either have no duplication or duplication, then he
probability that we will have duplication is:

(1 - probability of not having duplication)

######
>>> 1 - (50.0 * 49 * 48 * 47 * 46 * 45) / (50 **6)
0.26775654400000004
######

So we have about a 27% percent chance of getting duplication when we roll
a fifty sided dice six times.

But that's just probability theory.  Is this real?  Is this truly
probable?

Let's test this experimentally.  First, we need some way of generating a
roll of dice.  Here's a quick-and-dirty function to do that:

######
>>> def make_roll(n, r):
...     return [random.randrange(n) for i in range(r)]
...
>>> make_roll(50, 6)
[11, 48, 39, 43, 30, 24]
>>> make_roll(50, 6)
[6, 47, 0, 34, 9, 38]
######

We can test things experimentally by writing a function to see a roll has
duplication:

######
>>> def has_duplication(roll):
...     d = {}
...     for number in roll:
...         d[number] = d.get(number, 0) + 1
...     for count in d.values():
...         if count > 1:
...             return True
...     return False
...
>>> has_duplication([1, 2, 3])
False
>>> has_duplication([1, 2, 2])
True
######

Once we have this tool, then it becomes easy to do a trial run and see, if
we roll 10 times, what percentage of those are duplicates.

######
>>> count_duplicates = 0
>>> for i in range(10):
...     if has_duplication(make_roll(50, 6)):
...         count_duplicates = count_duplicates + 1
...
>>> count_duplicates
1
>>> 1.0 / 10
0.10000000000000001
######

Experimentally, we're seeing 1%, but that might just be a fluke.  Or maybe
it's just because the trial size is much too small.  *grin*

Let's formalize this test as a function, just to make it easier to retry:

######
>>> def do_trial(n):
...     """Returns the number of duplicates if we do a n-trial."""
...     count_duplicates = 0
...     for i in range(n):
...         if has_duplication(make_roll(50, 6)):
...             count_duplicates = count_duplicates + 1
...     return float(count_duplicates) / n
...
>>>
>>> do_trial(5)
0.40000000000000002
>>> do_trial(10)
0.29999999999999999
>>> do_trial(100)
0.27000000000000002
>>> do_trial(1000)
0.26900000000000002
>>> do_trial(10000)
0.2621
>>> do_trial(100000)
0.26676
######

So we're experimentally getting evidence that the probability of rolling
duplicates is about 26%, if we have a 50-sided dice six times.  Hey,
that's not bad: that's about the value we got from doing math.

As we scale our expeiment higher, we start seeing the Law of Large Numbers
taking place: when we use lots of trials, then our experimental results
start getting very close to the one we calculated with probability theory.

http://en.wikipedia.org/wiki/Law_of_large_numbers

Here's a punchline: the problem we've been looking at is another disguise

http://en.wikipedia.org/wiki/Birthday_problem

That article makes the assertion that if we have 23 people in a room, the
probability that we have at least two people with the same birthday is
pretty good (about 50%).  We can modify do_trial()  to experimentally see
that!

######
>>> def do_trial(n, x, y):
...     """Returns the number of duplicates if we do a n-trial."""
...     count_duplicates = 0
...     for i in range(n):
...         if has_duplication(make_roll(x, y)):
...             count_duplicates = count_duplicates + 1
...     return float(count_duplicates) / n
...
>>> do_trial(10000, 365, 23)
0.51390000000000002
>>> do_trial(10000, 365, 1)
0.0
>>> do_trial(10000, 365, 2)
0.0023
>>> for i in range(23):
...     print i, "people, probability of shared birthday:",
do_trial(10000, 365, i)
...
0 people, probability of shared birthday: 0.0
1 people, probability of shared birthday: 0.0
2 people, probability of shared birthday: 0.0035
3 people, probability of shared birthday: 0.0087
4 people, probability of shared birthday: 0.0181
5 people, probability of shared birthday: 0.0273
6 people, probability of shared birthday: 0.0436
7 people, probability of shared birthday: 0.0551
8 people, probability of shared birthday: 0.0736
9 people, probability of shared birthday: 0.0932
10 people, probability of shared birthday: 0.1201
11 people, probability of shared birthday: 0.1325
12 people, probability of shared birthday: 0.1654
13 people, probability of shared birthday: 0.1967
14 people, probability of shared birthday: 0.2241
15 people, probability of shared birthday: 0.2581
16 people, probability of shared birthday: 0.281
17 people, probability of shared birthday: 0.3169
18 people, probability of shared birthday: 0.3497
19 people, probability of shared birthday: 0.3843
20 people, probability of shared birthday: 0.4171
21 people, probability of shared birthday: 0.438
22 people, probability of shared birthday: 0.4817
######

The variable names 'x' and 'y' suck (I'll try to think of better ones next
time), but I hope that it's clear what we're doing: we're making
do_trial() more general so it can handle different dice rolls.  And
because they are parameters in our do_trial() function, we can then see
what the situation loosk like as we get more and more kids in the
classroom.