[Patches] [ python-Patches-629637 ] Add a sample selection method to random.py

noreply@sourceforge.net noreply@sourceforge.net
Thu, 07 Nov 2002 22:36:03 -0800


Patches item #629637, was opened at 2002-10-27 21:03
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=629637&group_id=5470

Category: Library (Lib)
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Raymond Hettinger (rhettinger)
Assigned to: Tim Peters (tim_one)
Summary: Add a sample selection method to random.py

Initial Comment:
random.randset(n, k) returns a k length list of unique 
integers in the range [0,n).

Improves on a Cookbook submission by using the 
parameters to select between a shuffle algorithm and 
a dictionary algorithm.

I want to add this to the library because it is a simple, 
robust solution to a general selection problem and 
because it isn't obvious that two different algorithms 
are needed to balance speed/space trade-offs.

If approved, will add docs and a news item.

----------------------------------------------------------------------

>Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-08 01:36

Message:
Logged In: YES 
user_id=80475

I'm re-learning to hate the patch process.  This was a 
straight-forward, thoroughy tested, useful patch.  Getting it 
accepted wasn't supposed to be hard.

What is the next step  --  Take it as is, convert the n 
argument to choice() style population list, or withdraw the 
patch?



----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2002-11-07 19:55

Message:
Logged In: YES 
user_id=6380

I'm not even looking at this, I'm delegating this to Tim. He
knows infinitely more about random and permutations than I
do, and he's actually used this stuff.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-07 18:30

Message:
Logged In: YES 
user_id=80475

Assigned to GvR for pronouncement on
a) whether he agrees that a sampling function is useful.
b) whether to implement it as is or with sequence 
arguments
c) whether to leave it in random or put in another module.

The current form returns a list of integers that can be used 
directly or as indices into a sequence.  The advantages are 
flexibility in use and the ability to pick a hundred elements 
out of ten million without building a long list first.  The 
approach is essentially a uniquified list of calls to 
randrange().  Tim prefers an approach that parallels 
random.choice() where the call looks like:
   random.sample([a,b,c,d,e], 2)  # picks 2 of the 5 objects

I think the function belongs in the random module since it 
is a primary use of random numbers (just like shuffle() 
and choose()).  Tim prefers to have a separate library 
module that has a whole grab bag of combinatorics.


----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-05 16:38

Message:
Logged In: YES 
user_id=80475

Thanks for the quick follow-ups.

The switchover ratio of six came from counting pointers 
and longs.  Shuffling uses an n length list at one pointer 
for each element.  The dictionary approach has k 
elements with a hash code, a key pointer, and a value 
pointer for a total of three multiplied by 1.5 and rounding 
up to five (because dict loading is kept under 2/3) and one 
pointer for the 'inorder' return list for a total of six.  Also, I 
liked six to minimize resampling in the dictionary 
approach (keeping it under 20%).

As requested, I'll add the random argument to the 
documentation.

Originally, I was going to have sample() select from an 
arbitrary collection (like choose() does) but, in the end, 
preferred the current approach of choosing integers.  This 
approach allows sample(1000000,60) without building a 
giant list first.  Also, converting from indices to elements is 
trivial:   [colorlist[i] for i in random.sample(len
(colorlist),5)].  

I avoided the n/2 complement selection technique 
because of use case rarity and to allow the sample itself to 
be in random order (oxymoron?).  If you guys think it's 
necessary, I'll add a complement selection branch 
followed by a call to random.shuffle().  Still, as it stands, 
the code is robust, uses space no larger than a k sized 
dictionary, and runs with no more than 1.2*k calls to 
random().

I don't know why CombGen.py never made it to 
Tools/scripts. Even if it does, I think a random sampling 
function belongs in the random module where people can 
find it -- it is a very common use of random numbers.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2002-11-05 13:31

Message:
Logged In: YES 
user_id=31435

I agree this is useful, but would rather see Python grow 
libraries for combinatorial objects.  There are many things 
beyond this that are also useful,  For example, the examples 
you gave here were of selections from collections that aren't 
range(n), and it would be more useful to more people to have 
a way to choose k elements from an arbitrary n-element 
collection directly (like a collection of transactions, or a set of 
cards, whatever).

Note that I posted a module to Python-Dev not long ago that 
implements such stuff (CombGen.py), along with other useful 
functions on combinations.

Note that when k > n/2, "the usual trick" isn't to shuffle a list, 
but to generate a complement selection.  For example, if you 
want a random sample of 9999 out of 10000, it's a lot more 
efficient to pick the single element that's *not* in the result.  
See CombGen for code to do this.

----------------------------------------------------------------------

Comment By: Martin v. L÷wis (loewis)
Date: 2002-11-05 12:34

Message:
Logged In: YES 
user_id=21627

Thanks for the explanation. On to the implementation: How
did you arrive at the factor of 6 between a dictionary and a
list?

The documentation should mention the random optional argument.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-05 10:36

Message:
Logged In: YES 
user_id=80475

Like shuffle() and choose(), random sampling without 
replacement is one of the core principal use cases for 
random numbers.

Acceptance testing often requires a fixed number of non-
overlapping samples i.e.  Selecting 60 transactions out of 
a 1000 and finding zero errors yields a 95% confidence 
that the population has less than a 5% error rate.

Some simulations also need groups of non-overlapping 
samples i.e. a lottery result of six unique numbers 
selected from a range of 1 to 57.  An electronic raffle picks 
consecutive winners without allowing previous winners to 
be  reselected.

While sampling with replacement is trivial to implement 
with a list comprehension, sampling without replacement 
has a number of implementation nuances that makes it 
worthwhile to have a robust solution already implemented 
in the random library.




----------------------------------------------------------------------

Comment By: Martin v. L÷wis (loewis)
Date: 2002-11-05 03:27

Message:
Logged In: YES 
user_id=21627

Can you explain why this needs to be in the standard
library? I.e. what typical application would use it?

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-11-05 01:33

Message:
Logged In: YES 
user_id=80475

Martin, do you have time to give this patch a second 
review?

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-10-31 02:29

Message:
Logged In: YES 
user_id=80475

Added new version with local variable optimization and 
with the dictionary results returned in selection order.


----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-10-28 23:42

Message:
Logged In: YES 
user_id=80475

Renamed to random.sample(n,k) to show that it is used 
for sampling without replacement.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-10-28 23:42

Message:
Logged In: YES 
user_id=80475

Renamed to random.sample(n,k) to show that it is used 
for sampling without replacement.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-10-28 07:54

Message:
Logged In: YES 
user_id=80475

Added full patch with news item and docs.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=629637&group_id=5470