[Python-ideas] random.choice on non-sequence

Wed Apr 13 00:09:30 EDT 2016

Thanks to everyone for all the feedback so far.

Interesting.  I had not heard of reservoir sampling, and it had not 
occurred to me that one could select a uniformly random element from a 
sequence of unknown length without copying it.

I am with Ben in that I consider the most important type for 
random.choice to accept, that it doesn't already, is "set".  If just 
that could be achieved, yes Ben, I think that would be a plus.
(The fact that I could write a version that handled generators 
(efficiently or not) just seemed to be a bonus.)
ISTM however that there is a problem accepting dict, because of the 
ambiguity - does it mean a random key, a random value, or a random 
(key,value) item?
(I would pick a random key (consistent with "for k in MyDict"), but ISTM 
"explicit is better than implict".)

There are problems with arbitrary iterables as Ben points out.  But I 
don't see these as an objection to making random.choice accept them, because
       (1) It is easy to produce examples of stdlib functions that can 
be crashed with infinite iterables, e.g.
                 >>> def forever():
                 ...        while 1: yield 1
                 ...
                 >>> itertools.product(forever(), forever())
                 Traceback (most recent call last):
                   File "<stdin>", line 1, in <module>
                 MemoryError
         (2) As I indicated in my original post, it is understood that 
generators are "throw-away", so may need to be copied.
'Consenting adults', no?

One small point: Whenever I've tested it (quite a few times, for 
different projects), I've always found that it was faster to copy to a 
tuple than to a list.

I did a little experimenting (admittedly non-rigorous, on one platform 
only, and using Python 2.7.10, not Python 3, and using code which could 
very possibly be improved) on selecting a random element from a 
generator, and found that
     for small or moderate generators, reservoir sampling was almost 
always slower than generating a tuple
     as the generator length increased up to roughly 10,000, the ratio
             (time taken by reservoir)  / (time taken by tuple)
         increased, reaching a maximum of over 4
     as the generator length increased further, the ratio started to 
decrease, although for a length of 80 million (about as large as I could 
test) it was still over 3.
(This would change if the reservoir approach could be recoded in a way 
that was amazingly faster than the way I did it, but, arrogance aside, I 
doubt that.  I will supply details of my code on request.)
I think this is a tribute to the efficiency of tuple building.
It's also enough to convince me that reservoir sampling, or Marco 
Cognetta's compromise approach, are non-starters.
More rigorous testing would be necessary to convince the rest of the world.

I am also pretty well convinced from actual tests (not from theoretical 
reasoning) that:
     the "convert it to a tuple" recipe is not far off O(N), both for 
sets and generators (it gets worse for generators that produce objects 
with large memory footprints), and is certainly fast enough to be useful 
(the point at which it gets unacceptably slow is generally not far from 
the point where you run out of memory).
I did try an approach similar to Chris Angelico's of iterating over sets 
up to the selection point (my bikeshed actually used enumerate rather 
than zip), but it was much slower than the "tuple-ise" algorithm.

On 13/04/2016 00:04, Guido van Rossum wrote:
> This still seems wrong *unless* we add a protocol to select a random
> item in O(1) time. There currently isn't one for sets and mappings --
> only for sequences.
Hm, would this need a change to the structure of dicts and sets?
Is there by any chance a quick way of getting the Nth item added in an 
OrderedDict?
>   It would be pretty terrible if someone wrote code
> to get N items from a set of size N and their code ended up being
> O(N**2).
Er, I'm not quite sure what you're saying here.
random.sample allows you to select k items from a set of size N (it 
converts the set to a tuple).
If k=N, i.e. you're trying to select the whole set, that's a silly thing 
to do if you know you're doing it.
But even in this case, random.sample is approximately O(N); I tested it.
If someone writes their own inefficient version of random.sample, that's 
their problem.
Have I misunderstood?  Do you mean selecting N items with replacement?

Hm, it occurs to me that random.sample itself could be extended to 
choose k unique random elements from a general iterable.
I have seen a reservoir sampling algorithm that claims to be O(N):
     https://en.wikipedia.org/wiki/Reservoir_sampling

Rob

  On Tue, Apr 12, 2016 at 3:57 PM, Ben Finney 
<ben+python at benfinney.id.au> wrote:
>> Rob Cliffe <rob.cliffe at btinternet.com> writes:
>>
>>> It surprised me a bit the first time I realised that random.choice did
>>> not work on a set. (One expects "everything" to "just work" in Python!
>>> :-) )
>> I am +1 on the notion that an instance of ‘tuple’, ‘list’, ‘set’, and
>> even ‘dict’, should each be accepted as input for ‘random.choice’.
>>
>>> On 2011-06-23, Sven Marnach posted 'A few suggestions for the random
>>> module' on Python-Ideas, one of which was to allow random.choice to
>>> work on an arbitrary iterable.
>> I am −1 on the notion of an arbitrary iterable as ‘random.choice’ input.
>>
>> Arbitrary iterables may be consumed merely by being iterated. This will
>> force the caller to make a copy, which may as well be a normal sequence
>> type, defeating the purpose of accepting that “arbitrary iterable” type.
>>
>> Arbitrary iterables may never finish iterating. This means the call
>> to ‘random.choice’ would sometimes never return.
>>
>> The ‘random.choice’ function should IMO not be responsible for dealing
>> with those cases.
>>
>>
>> Instead, a more moderate proposal would be to have ‘random.choice’
>> accept an arbitrary container.
>>
>> If the object implements the container protocol (by which I think I mean
>> that it conforms to ‘collections.abc.Container’), it is safe to iterate
>> and to treat its items as a collection from which to choose a random
>> item.
>>
>>
>> Rob, does that proposal satisfy the requirements that motivated this?
>>
>> --
>>   \        “Some people, when confronted with a problem, think ‘I know, |
>>    `\       I'll use regular expressions’. Now they have two problems.” |
>> _o__)                           —Jamie Zawinski, in alt.religion.emacs |
>> Ben Finney
>>
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
>
>