Python COM and distribution issues

David Bolen db3l at fitlinxx.com
Wed May 16 21:09:38 EDT 2001


hungjunglu at yahoo.com writes:

> There is a famous birthday problem for probability as taught in high 
> schools: "What is the typical size of a class for two persons to have 
> the same birthday?" Most people would say that since there are 365 
> days, they'd guess somewhere around half that number, that is, for a 
> class of around 180 students.
> 
> The actual number is more like the square root of 365, that is, for a 
> group of 20 people or so (order of magnitude), you'd already have a 
> pretty good chance of finding two people with the same birthday. 
> 
> For p students, the probability of finding NO students with the same 
> birthday is about 
> 
>    prob. for zero collision = exp[p(p-1)/(2N)] (where N=365)

I'm by no means a mathematician (so be gentle), but while this may be
true for birthday's or other random assignments I think applying it
literally to GUIDs is inappropriate, because the GUID generation
algorithm has both time sensitivity and globally unique components,
and thus collisions are not really probabilistic over the entire range
of values, but only among those GUIDs for which one of the individual
components are the same.

> 2^128 is about 3.4x10^39, square root of that is 10^19 or so. There 
> are 10^7 seconds in a year (order of magnitude), age of universe is 
> about 10^10 years. If you generate a billion (10^9) GUIDs per second, 
> I'd say you have about (7+10+9-19) = 7 orders of power of 10, that 
> is, tens of millions of GUID collisions.

I'm not sure I agree that you can handle the entire set of generated
GUIDs as a single set of numbers for which you can take the
probabilities.

If you know that the set of GUIDs generated at time T (which has a
resolution down to some number of nanoseconds) simply cannot collide
with GUIDs at any other time T, then your only concern for collisions
is a probability of collisions at time T, taken over the range of such
time Ts.

Then if you specifically design your selection of GUIDs to ensure
uniqueness at a given time T (say, by the MAC address, and/or some
volatile machine specific information), and then throw in a counter
just in case you need more than one GUID for each time T, you can do a
really really good job of avoiding collisions at that time.

In your birthday example, it would be like assigning a unique number
to each set of parents, and then combining that number with the
birthday for each child, along with a count of which child it was.
And if the parents forget their unique number, then one is created out
of a combination of various bits of information at the time of birth
(their ages, height, weight, and so on) and then a bit reserved so it
cannot possibly collide with any officially assigned family number.

So then your question is going to become not "who has the same
birthday" but "who has the same birthday, unique family identifier,
and child number".  And even though each family is operating
independently, your risk of collision is going to be unbelievably
small.

Now, assigning the unique family identifiers, and coming up with the
volatile information is critical to the approach, but if done right, I
think your estimate of collisions will be far too high.

> I did the math, years ago. Don't tell people to do the math when you 
> yourself cannot do the math. :)

I'll admit that I wouldn't have had the math at my fingertips, but as
much as the math, understanding the algorithm is critical.  We're not
talking about random numbers chosen out of a 2^128 space, but an
algorithm specifically designed to avoid collisions under a
distributed generation system, across both time and space (at least
for the next 1500 years or so).

An archived copy of an older I-D that described the layout and
algorithm can be found at:

    http://www.opengroup.org/dce/info/draft-leach-uuids-guids-01.txt

for example.

--
-- David
-- 
/-----------------------------------------------------------------------\
 \               David Bolen            \   E-mail: db3l at fitlinxx.com  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \
\-----------------------------------------------------------------------/



More information about the Python-list mailing list