Random from Dictionary

Sun Oct 21 17:29:07 EDT 2001

[Tim]
> ...
> In 2.2 you could use a more memory-efficient scheme via iterating over
> d directly:
>
> def random_from_dict(d, x):
>     "Return and remove x random (k, v) pairs from d."
>     from random import random
>     n = float(len(d))  # float() so x/n later doesn't truncate to 0
>     if not 1 <= x <= n:
>         raise ValueError("go away")
>     result = []
>     for k in d:
>         # Choose this item with probability x/n.
>         if random() <= x/n:
>             result.append((k, d[k]))
>             x -= 1
>             if x == 0:
>                 break
>         n -= 1
>     for k, v in result:
>         del d[k]
>     return result

[Martin von Loewis]
> Does that give each key in the dictionary the same probability of
> being picked (assuming random() distributes uniformly)?

Yes.

> If yes, is there some easy way to "see" that this is the case?

Not an entirely convincing easy way that I know of.  I view it like this:
The first element is obviously chosen with probability x/n, and that's
obviously correct.  If it's not chosen, then we need to pick an
x-combination of the n-1 remaining elements at random, so decrementing n
alone "obviously" does the right thing to reduce the problem to this new
case.  OTOH, if we do choose the first element, then what remains is to
choose a random x-1 combination of the remaining n-1 elements, so
decrementing both x and n is "obviously" correct to reduce the problem to
that new case.  You can turn that into a formal proof without *much*
difficulty.

"Algorithm S" in Knuth Vol 2 Ed 3 Section 3.4.2 is a more confusingly (to my
mind) coded version of this.  The text remarks that the algorithm appears
"to be incorrect", but leaves it to a series of detailed exercises to prove
it's unbiased.  Here's a start:  again, the first element is obviously
chosen with probability x/n.  It is or it isn't picked.  If it is, the 2nd
element is chosen with (conditional) probability x/n * (x-1)/(n-1); if it
isn't, with cond prob (n-x)/n * x/(n-1).  Add those together to get the prob
that the 2nd element is chosen:

    x*(x-1)   (n-x)*x   x*(x-1+n-x)   x*(n-1)   x
    ------- + ------- = ----------- = ------- = -
    n*(n-1)   n*(n-1)     n*(n-1)     n*(n-1)   n

Now wave your hands a lot for the rest, or break down and do some actual
math <wink>.

> If no, could one design an algorithm that iterates over a dictionary
> only once, still offers the same probability to each key?

Already done by the above.  A more efficient way, but harder to work out the
details, is to exploit that there are comb(n, x) possible x- combinations of
the integers in range(n).  Establish a computable bijection beteen all
x-combinations of range(n) and the integers in range(comb(n, x)).  Pick one
integer from range(comb(n, x)) at random, and compute the corresponding
x-combination of range(n) under the bijection.  This can be used as a list
of indices to suck out x elements from d.items().

playing-the-odds-ly y'rs  - tim