[Python-ideas] random.choice on non-sequence

Tue Apr 12 20:07:15 EDT 2016

At the risk of making this not very easily accessible, couldn't we use
both approaches (reservoir sampling and just loading the whole thing
into memory) and apply the ski-rental problem? We could come up with
some cost for calling random and use that to determine when we should
just give up and load it into memory if we do not know the length of
the sequence produced by the generator. If the user knows something
about the generator beforehand they could specify in some optional
parameter whether or not they want to use reservoir sampling or just
load it into memory at the start.

The theory side of me likes this but I admit its a little ugly to be
included in Python's stdlib.

-Marco

On Tue, Apr 12, 2016 at 7:05 PM, Mahmoud Hashemi <mahmoud at hatnote.com> wrote:
> Reservoir sampling is super handy. For a Pythonic introduction, along with
> links to implementation, see here:
> https://www.paypal-engineering.com/2016/04/11/statistics-for-software/#dipping_into_the_stream
>
> Even though I love the elegance enough to write and illustrate about it, I'd
> be very surprised to have found it built into Python at this low of a level.
>
> Mahmoud
>
> On Tue, Apr 12, 2016 at 3:25 PM, Guido van Rossum <guido at python.org> wrote:
>>
>> Ow, sorry! That's the algorithm I meant to reference (the link I
>> actually gave is for a different situation). See also
>> https://en.wikipedia.org/wiki/Reservoir_sampling.
>>
>> It does indeed avoid storing a copy of the sequence -- at the cost of
>> N calls to random(). Try timing it -- I wouldn't be surprised if
>> copying a set of N elements into a tuple is a lot faster than N
>> random() calls.
>>
>> So we're stuck with two alternatives, neither of which is always the
>> best: (1) just copy the sequence (like Rob's proposal) -- this loses
>> big time if the sequence is large and not already held in memory; (2)
>> use reservoir sampling, at the cost of many more random() calls.
>>
>> Since Rob didn't link to the quoted conversation I don't have it handy
>> to guess what Raymond meant with "Bentley" either, but I'm guessing
>> Jon Bentley wrote about reservoir sampling (before it was named that).
>> I recall hearing about the algorithm for the first time in the late
>> '80s from a coworker with Bell Labs ties.
>>
>> (Later, Ethan wrote)
>> > So the objection is that because performance can vary widely it is
>> > better for users to select their own algorithm rather than rely on a
>> > one-size-fits-all stdlib solution?
>>
>> That's pretty much it. Were this 1995, I probably would have put some
>> toy in the stdlib. But these days we should be more careful than that.
>>
>> --Guido
>>
>> On Tue, Apr 12, 2016 at 3:09 PM, Marco Cognetta
>> <cognetta.marco at gmail.com> wrote:
>> > Maybe I am misunderstanding but for a generator, couldn't you avoid
>> > storing it in memory and just using the following online algorithm to
>> > save space? http://stackoverflow.com/a/23352100
>> >
>> > (I hope this message goes through correctly, first time participating
>> > in a discussion on this mailing list).
>> >
>> > -Marco
>> >
>> > On Tue, Apr 12, 2016 at 5:53 PM, Guido van Rossum <guido at python.org>
>> > wrote:
>> >> The problem with your version is that copying the input is slow if it
>> >> is large.
>> >>
>> >> Raymond was referencing the top answer here:
>> >>
>> >> http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-random-combination-of-values.
>> >> It's also slow though (draws N random values if the input is of length
>> >> N).
>> >>
>> >> On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe <rob.cliffe at btinternet.com>
>> >> wrote:
>> >>> It surprised me a bit the first time I realised that random.choice did
>> >>> not
>> >>> work on a set.  (One expects "everything" to "just work" in Python!
>> >>> :-) )
>> >>> There is a simple workaround (convert the set to a tuple or list
>> >>> before
>> >>> passing it) but ...
>> >>>
>> >>> On 2011-06-23, Sven Marnach posted 'A few suggestions for the random
>> >>> module'
>> >>> on Python-Ideas, one of which was to allow random.choice to work on an
>> >>> arbitrary iterable.
>> >>> Raymond Hettinger commented
>> >>>     "It could be generalized to arbitrary iterables (Bentley provides
>> >>> an
>> >>> example of how to do this) but it is fragile (i.e. falls apart badly
>> >>> with
>> >>> weak random number generators) and doesn't correspond well with real
>> >>> use
>> >>> cases."
>> >>> Sven replied
>> >>>     "Again, there definitely are real world use cases ..."
>> >>>
>> >>> I would like to revive this suggestion.  I don't understand Raymond's
>> >>> comment.  It seems to me that a reasonable implementation is better
>> >>> and more
>> >>> friendly than none, particularly for newbies.
>> >>> This is the source code for random.choice in Python 3.2 (as far as I
>> >>> know it
>> >>> hasn't changed since):
>> >>>
>> >>>
>> >>>     def choice(self, seq):
>> >>>         """Choose a random element from a non-empty sequence."""
>> >>>         try:
>> >>>             i = self._randbelow(len(seq))
>> >>>         except ValueError:
>> >>>             raise IndexError('Cannot choose from an empty sequence')
>> >>>         return seq[i]
>> >>>
>> >>>
>> >>> And here is my proposed (tested) version:
>> >>>
>> >>>
>> >>>     def choice(self, it):
>> >>>         """Choose a random element from a non-empty iterable."""
>> >>>         try:
>> >>>             it[0]
>> >>>         except IndexError:
>> >>>             raise IndexError('Cannot choose from an empty sequence')
>> >>>         except TypeError:
>> >>>             it = tuple(it)
>> >>>         try:
>> >>>             i = self._randbelow(len(it))
>> >>>         except ValueError:
>> >>>             raise IndexError('Cannot choose from an empty iterable')
>> >>>         return it[i]
>> >>>
>> >>>
>> >>> This works on (e.g.) a list/tuple/string, a set, a dictionary view or
>> >>> a
>> >>> generator.
>> >>> Obviously the generator has to be 'fresh' (i.e. any previously
>> >>> consumed
>> >>> values will be excluded from the choice) and 'throw-away'
>> >>> (random.choice
>> >>> will exhaust it).
>> >>> But it means you can write code like
>> >>>     random.choice(x for x in xrange(10) if x%3)    # this feels quite
>> >>> "Pythonic" to me!
>> >>>
>> >>> (I experimented with versions that, when passed an object that
>> >>> supported
>> >>> len() but not indexing, viz. a set, iterated through the object as far
>> >>> as
>> >>> necessary instead of converting it to a list.  But I found that this
>> >>> was
>> >>> slower in practice as well as more complicated.  There might be a
>> >>> memory
>> >>> issue with huge iterables, but we are no worse off using existing
>> >>> workarounds for those.)
>> >>>
>> >>> Downsides of proposed version:
>> >>>     (1) Backward compatibility.  Could mask an error if a "wrong"
>> >>> object,
>> >>> but one that happens to be an iterable, is passed to random.choice.
>> >>>           There is also slightly different behaviour if a dictionary
>> >>> (D) is
>> >>> passed (an unusual thing to do):
>> >>>                 The current version will get a random integer in the
>> >>> range
>> >>> [0, len(D)), try to use that as a key of D,
>> >>>                    and raise KeyError if that fails.
>> >>>                 The proposed version behaves similarly except that it
>> >>> will
>> >>> always raise "KeyError: 0" if 0 is not a key of D.
>> >>>                 One could argue that this is an improvement, if it
>> >>> matters
>> >>> at all (error reproducibility).
>> >>>                 And code that deliberately exploits the existing
>> >>> behaviour,
>> >>> provided that it uses a dictionary whose keys are consecutive integers
>> >>> from
>> >>> 0 upwards, e.g.
>> >>>                     D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 :
>> >>> 'three', 4 :
>> >>> 'four' }
>> >>>                     d = random.choice(D)    # equivalent, in this
>> >>> example,
>> >>> to "d = random.choice(list(D.values()))"
>> >>>                     if d == 'zero':
>> >>>                         ...
>> >>>                 will continue to work (although one might argue that
>> >>> it
>> >>> deserves not to).
>> >>>     (2) Speed - the proposed version is almost certainly slightly
>> >>> slower
>> >>> when called with a sequence.
>> >>>          For what it's worth, I tried to measure the difference on my
>> >>> several-years-old Windows PC, but is was hard to measure accurately,
>> >>> even
>> >>> over millions of iterations.
>> >>>          All I can say is that it appeared to be a small fraction of a
>> >>> millisecond, possibly in the region of 50 nanoseconds, per call.
>> >>>
>> >>> Best wishes
>> >>> Rob Cliffe
>> >>> _______________________________________________
>> >>> Python-ideas mailing list
>> >>> Python-ideas at python.org
>> >>> https://mail.python.org/mailman/listinfo/python-ideas
>> >>> Code of Conduct: http://python.org/psf/codeofconduct/
>> >>
>> >>
>> >>
>> >> --
>> >> --Guido van Rossum (python.org/~guido)
>> >> _______________________________________________
>> >> Python-ideas mailing list
>> >> Python-ideas at python.org
>> >> https://mail.python.org/mailman/listinfo/python-ideas
>> >> Code of Conduct: http://python.org/psf/codeofconduct/
>> > _______________________________________________
>> > Python-ideas mailing list
>> > Python-ideas at python.org
>> > https://mail.python.org/mailman/listinfo/python-ideas
>> > Code of Conduct: http://python.org/psf/codeofconduct/
>>
>>
>>
>> --
>> --Guido van Rossum (python.org/~guido)
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
>
>