[Python-ideas] random.choice on non-sequence

Tue Apr 12 18:25:05 EDT 2016

Ow, sorry! That's the algorithm I meant to reference (the link I
actually gave is for a different situation). See also
https://en.wikipedia.org/wiki/Reservoir_sampling.

It does indeed avoid storing a copy of the sequence -- at the cost of
N calls to random(). Try timing it -- I wouldn't be surprised if
copying a set of N elements into a tuple is a lot faster than N
random() calls.

So we're stuck with two alternatives, neither of which is always the
best: (1) just copy the sequence (like Rob's proposal) -- this loses
big time if the sequence is large and not already held in memory; (2)
use reservoir sampling, at the cost of many more random() calls.

Since Rob didn't link to the quoted conversation I don't have it handy
to guess what Raymond meant with "Bentley" either, but I'm guessing
Jon Bentley wrote about reservoir sampling (before it was named that).
I recall hearing about the algorithm for the first time in the late
'80s from a coworker with Bell Labs ties.

(Later, Ethan wrote)
> So the objection is that because performance can vary widely it is better for users to select their own algorithm rather than rely on a one-size-fits-all stdlib solution?

That's pretty much it. Were this 1995, I probably would have put some
toy in the stdlib. But these days we should be more careful than that.

--Guido

On Tue, Apr 12, 2016 at 3:09 PM, Marco Cognetta
<cognetta.marco at gmail.com> wrote:
> Maybe I am misunderstanding but for a generator, couldn't you avoid
> storing it in memory and just using the following online algorithm to
> save space? http://stackoverflow.com/a/23352100
>
> (I hope this message goes through correctly, first time participating
> in a discussion on this mailing list).
>
> -Marco
>
> On Tue, Apr 12, 2016 at 5:53 PM, Guido van Rossum <guido at python.org> wrote:
>> The problem with your version is that copying the input is slow if it is large.
>>
>> Raymond was referencing the top answer here:
>> http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-random-combination-of-values.
>> It's also slow though (draws N random values if the input is of length
>> N).
>>
>> On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe <rob.cliffe at btinternet.com> wrote:
>>> It surprised me a bit the first time I realised that random.choice did not
>>> work on a set.  (One expects "everything" to "just work" in Python! :-) )
>>> There is a simple workaround (convert the set to a tuple or list before
>>> passing it) but ...
>>>
>>> On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module'
>>> on Python-Ideas, one of which was to allow random.choice to work on an
>>> arbitrary iterable.
>>> Raymond Hettinger commented
>>>     "It could be generalized to arbitrary iterables (Bentley provides an
>>> example of how to do this) but it is fragile (i.e. falls apart badly with
>>> weak random number generators) and doesn't correspond well with real use
>>> cases."
>>> Sven replied
>>>     "Again, there definitely are real world use cases ..."
>>>
>>> I would like to revive this suggestion.  I don't understand Raymond's
>>> comment.  It seems to me that a reasonable implementation is better and more
>>> friendly than none, particularly for newbies.
>>> This is the source code for random.choice in Python 3.2 (as far as I know it
>>> hasn't changed since):
>>>
>>>
>>>     def choice(self, seq):
>>>         """Choose a random element from a non-empty sequence."""
>>>         try:
>>>             i = self._randbelow(len(seq))
>>>         except ValueError:
>>>             raise IndexError('Cannot choose from an empty sequence')
>>>         return seq[i]
>>>
>>>
>>> And here is my proposed (tested) version:
>>>
>>>
>>>     def choice(self, it):
>>>         """Choose a random element from a non-empty iterable."""
>>>         try:
>>>             it[0]
>>>         except IndexError:
>>>             raise IndexError('Cannot choose from an empty sequence')
>>>         except TypeError:
>>>             it = tuple(it)
>>>         try:
>>>             i = self._randbelow(len(it))
>>>         except ValueError:
>>>             raise IndexError('Cannot choose from an empty iterable')
>>>         return it[i]
>>>
>>>
>>> This works on (e.g.) a list/tuple/string, a set, a dictionary view or a
>>> generator.
>>> Obviously the generator has to be 'fresh' (i.e. any previously consumed
>>> values will be excluded from the choice) and 'throw-away' (random.choice
>>> will exhaust it).
>>> But it means you can write code like
>>>     random.choice(x for x in xrange(10) if x%3)    # this feels quite
>>> "Pythonic" to me!
>>>
>>> (I experimented with versions that, when passed an object that supported
>>> len() but not indexing, viz. a set, iterated through the object as far as
>>> necessary instead of converting it to a list.  But I found that this was
>>> slower in practice as well as more complicated.  There might be a memory
>>> issue with huge iterables, but we are no worse off using existing
>>> workarounds for those.)
>>>
>>> Downsides of proposed version:
>>>     (1) Backward compatibility.  Could mask an error if a "wrong" object,
>>> but one that happens to be an iterable, is passed to random.choice.
>>>           There is also slightly different behaviour if a dictionary (D) is
>>> passed (an unusual thing to do):
>>>                 The current version will get a random integer in the range
>>> [0, len(D)), try to use that as a key of D,
>>>                    and raise KeyError if that fails.
>>>                 The proposed version behaves similarly except that it will
>>> always raise "KeyError: 0" if 0 is not a key of D.
>>>                 One could argue that this is an improvement, if it matters
>>> at all (error reproducibility).
>>>                 And code that deliberately exploits the existing behaviour,
>>> provided that it uses a dictionary whose keys are consecutive integers from
>>> 0 upwards, e.g.
>>>                     D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 : 'three', 4 :
>>> 'four' }
>>>                     d = random.choice(D)    # equivalent, in this example,
>>> to "d = random.choice(list(D.values()))"
>>>                     if d == 'zero':
>>>                         ...
>>>                 will continue to work (although one might argue that it
>>> deserves not to).
>>>     (2) Speed - the proposed version is almost certainly slightly slower
>>> when called with a sequence.
>>>          For what it's worth, I tried to measure the difference on my
>>> several-years-old Windows PC, but is was hard to measure accurately, even
>>> over millions of iterations.
>>>          All I can say is that it appeared to be a small fraction of a
>>> millisecond, possibly in the region of 50 nanoseconds, per call.
>>>
>>> Best wishes
>>> Rob Cliffe
>>> _______________________________________________
>>> Python-ideas mailing list
>>> Python-ideas at python.org
>>> https://mail.python.org/mailman/listinfo/python-ideas
>>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>>
>>
>> --
>> --Guido van Rossum (python.org/~guido)
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/

-- 
--Guido van Rossum (python.org/~guido)