Random from Dictionary
Tim Peters
tim.one at home.com
Sun Oct 21 17:29:07 EDT 2001
[Tim]
> ...
> In 2.2 you could use a more memory-efficient scheme via iterating over
> d directly:
>
> def random_from_dict(d, x):
> "Return and remove x random (k, v) pairs from d."
> from random import random
> n = float(len(d)) # float() so x/n later doesn't truncate to 0
> if not 1 <= x <= n:
> raise ValueError("go away")
> result = []
> for k in d:
> # Choose this item with probability x/n.
> if random() <= x/n:
> result.append((k, d[k]))
> x -= 1
> if x == 0:
> break
> n -= 1
> for k, v in result:
> del d[k]
> return result
[Martin von Loewis]
> Does that give each key in the dictionary the same probability of
> being picked (assuming random() distributes uniformly)?
Yes.
> If yes, is there some easy way to "see" that this is the case?
Not an entirely convincing easy way that I know of. I view it like this:
The first element is obviously chosen with probability x/n, and that's
obviously correct. If it's not chosen, then we need to pick an
x-combination of the n-1 remaining elements at random, so decrementing n
alone "obviously" does the right thing to reduce the problem to this new
case. OTOH, if we do choose the first element, then what remains is to
choose a random x-1 combination of the remaining n-1 elements, so
decrementing both x and n is "obviously" correct to reduce the problem to
that new case. You can turn that into a formal proof without *much*
difficulty.
"Algorithm S" in Knuth Vol 2 Ed 3 Section 3.4.2 is a more confusingly (to my
mind) coded version of this. The text remarks that the algorithm appears
"to be incorrect", but leaves it to a series of detailed exercises to prove
it's unbiased. Here's a start: again, the first element is obviously
chosen with probability x/n. It is or it isn't picked. If it is, the 2nd
element is chosen with (conditional) probability x/n * (x-1)/(n-1); if it
isn't, with cond prob (n-x)/n * x/(n-1). Add those together to get the prob
that the 2nd element is chosen:
x*(x-1) (n-x)*x x*(x-1+n-x) x*(n-1) x
------- + ------- = ----------- = ------- = -
n*(n-1) n*(n-1) n*(n-1) n*(n-1) n
Now wave your hands a lot for the rest, or break down and do some actual
math <wink>.
> If no, could one design an algorithm that iterates over a dictionary
> only once, still offers the same probability to each key?
Already done by the above. A more efficient way, but harder to work out the
details, is to exploit that there are comb(n, x) possible x- combinations of
the integers in range(n). Establish a computable bijection beteen all
x-combinations of range(n) and the integers in range(comb(n, x)). Pick one
integer from range(comb(n, x)) at random, and compute the corresponding
x-combination of range(n) under the bijection. This can be used as a list
of indices to suck out x elements from d.items().
playing-the-odds-ly y'rs - tim
More information about the Python-list
mailing list