implement random selection in Python

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Sat Nov 17 04:13:43 CET 2007


On Fri, 16 Nov 2007 16:47:16 -0800, Bruza wrote:

> I think I need to explain on the probability part: the "prob" is a
> relative likelihood that the object will be included in the output list.
> So, in my example input of
> 
>   items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
> 
> So, for any size of N, 'Tom' (with prob of 45) will be more likely to be
> included in the output list of N distinct member than 'Mary' (prob of
> 30) and much more likely than that of 'John' (with prob of 10).


Since there are only four items, you can't select more than four distinct 
items, because the fifth has to be a duplicate of one of the others. And 
the probability/likelihood doesn't work either: if you insist on getting 
distinct items, then your results are are much limited:

N = 0, result: []

N = 1, four possible results:
['Mary'] ['John'] ['Tom'] or ['Jane']

N = 2, six possible results (assuming order doesn't matter):
['Mary', 'John'] ['Mary', 'Tom'] ['Mary', 'Jane'] ['John', 'Tom']
['John', 'Jane'] or ['Tom', 'Jane']

N = 3, four possible results:
['Mary', 'John', 'Tom'] ['Mary', 'John', 'Jane'] 
['Mary', 'Tom', 'Jane'] or ['John', 'Tom', 'Jane']

N = 4, one possible result:
['Mary', 'John', 'Tom', 'Jane']



> I know "prob" is not exactly the "probability" in the context of
> returning a multiple member list. But what I want is a way to "favor"
> some member in a selection process.

I don't think this is really well defined, but I'll take a stab in the 
dark at it.

Let's take the example above for N = 3:

items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)]
# use the lowest common factor for simplicity
items = [('Mary',6), ('John', 2), ('Tom', 9), ('Jane', 3)]

# N = 3 has four possible results:
R = ['Mary', 'John', 'Tom']
S = ['Mary', 'John', 'Jane']
T = ['Mary', 'Tom', 'Jane']
U = ['John', 'Tom', 'Jane']

You want to bias the four possible results above so that having Mary is 
three times more likely than having John. It sounds simple, but it isn't. 
To do so, you need to solve a whole series of simultaneous equations:

Pr(Mary) = Pr(R) + Pr(S) + Pr(T)
Pr(John) = Pr(R) + Pr(S) + Pr(U)
Pr(Mary) = 3*Pr(John)

And so on for all the other items.

This is a HARD problem to solve, and for most sets of likelihoods you 
might give, there won't be a solution. For example, you can't have a set 
of results where Mary is three times more likely than John, John is twice 
as likely as Tom, and Tom is four times more likely than Mary. It simply 
can't happen.

So we can't interpret the numbers as probabilities. We can't even 
interpret them more loosely as "likelihoods". Proof of that is to 
consider the case of N=4. There is only one possible result with four 
distinct items. So all of Mary, John, Tom and Jane are equally likely, in 
fact all of them are certain. The weightings you gave (30, 10, 45, 15) 
are meaningless.

So, given that what you are asking for is impossible, can you explain 
what you are actually trying to accomplish? Maybe there's a more feasible 
alternative.


-- 
Steven.



More information about the Python-list mailing list