[Python-ideas] random.choice on non-sequence

Tue Apr 12 19:05:27 EDT 2016

Reservoir sampling is super handy. For a Pythonic introduction, along with
links to implementation, see here:
https://www.paypal-engineering.com/2016/04/11/statistics-for-software/#dipping_into_the_stream

Even though I love the elegance enough to write and illustrate about it,
I'd be very surprised to have found it built into Python at this low of a
level.

Mahmoud

On Tue, Apr 12, 2016 at 3:25 PM, Guido van Rossum <guido at python.org> wrote:

> Ow, sorry! That's the algorithm I meant to reference (the link I
> actually gave is for a different situation). See also
> https://en.wikipedia.org/wiki/Reservoir_sampling.
>
> It does indeed avoid storing a copy of the sequence -- at the cost of
> N calls to random(). Try timing it -- I wouldn't be surprised if
> copying a set of N elements into a tuple is a lot faster than N
> random() calls.
>
> So we're stuck with two alternatives, neither of which is always the
> best: (1) just copy the sequence (like Rob's proposal) -- this loses
> big time if the sequence is large and not already held in memory; (2)
> use reservoir sampling, at the cost of many more random() calls.
>
> Since Rob didn't link to the quoted conversation I don't have it handy
> to guess what Raymond meant with "Bentley" either, but I'm guessing
> Jon Bentley wrote about reservoir sampling (before it was named that).
> I recall hearing about the algorithm for the first time in the late
> '80s from a coworker with Bell Labs ties.
>
> (Later, Ethan wrote)
> > So the objection is that because performance can vary widely it is
> better for users to select their own algorithm rather than rely on a
> one-size-fits-all stdlib solution?
>
> That's pretty much it. Were this 1995, I probably would have put some
> toy in the stdlib. But these days we should be more careful than that.
>
> --Guido
>
> On Tue, Apr 12, 2016 at 3:09 PM, Marco Cognetta
> <cognetta.marco at gmail.com> wrote:
> > Maybe I am misunderstanding but for a generator, couldn't you avoid
> > storing it in memory and just using the following online algorithm to
> > save space? http://stackoverflow.com/a/23352100
> >
> > (I hope this message goes through correctly, first time participating
> > in a discussion on this mailing list).
> >
> > -Marco
> >
> > On Tue, Apr 12, 2016 at 5:53 PM, Guido van Rossum <guido at python.org>
> wrote:
> >> The problem with your version is that copying the input is slow if it
> is large.
> >>
> >> Raymond was referencing the top answer here:
> >>
> http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-random-combination-of-values
> .
> >> It's also slow though (draws N random values if the input is of length
> >> N).
> >>
> >> On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe <rob.cliffe at btinternet.com>
> wrote:
> >>> It surprised me a bit the first time I realised that random.choice did
> not
> >>> work on a set.  (One expects "everything" to "just work" in Python!
> :-) )
> >>> There is a simple workaround (convert the set to a tuple or list before
> >>> passing it) but ...
> >>>
> >>> On 2011-06-23, Sven Marnach posted 'A few suggestions for the random
> module'
> >>> on Python-Ideas, one of which was to allow random.choice to work on an
> >>> arbitrary iterable.
> >>> Raymond Hettinger commented
> >>>     "It could be generalized to arbitrary iterables (Bentley provides
> an
> >>> example of how to do this) but it is fragile (i.e. falls apart badly
> with
> >>> weak random number generators) and doesn't correspond well with real
> use
> >>> cases."
> >>> Sven replied
> >>>     "Again, there definitely are real world use cases ..."
> >>>
> >>> I would like to revive this suggestion.  I don't understand Raymond's
> >>> comment.  It seems to me that a reasonable implementation is better
> and more
> >>> friendly than none, particularly for newbies.
> >>> This is the source code for random.choice in Python 3.2 (as far as I
> know it
> >>> hasn't changed since):
> >>>
> >>>
> >>>     def choice(self, seq):
> >>>         """Choose a random element from a non-empty sequence."""
> >>>         try:
> >>>             i = self._randbelow(len(seq))
> >>>         except ValueError:
> >>>             raise IndexError('Cannot choose from an empty sequence')
> >>>         return seq[i]
> >>>
> >>>
> >>> And here is my proposed (tested) version:
> >>>
> >>>
> >>>     def choice(self, it):
> >>>         """Choose a random element from a non-empty iterable."""
> >>>         try:
> >>>             it[0]
> >>>         except IndexError:
> >>>             raise IndexError('Cannot choose from an empty sequence')
> >>>         except TypeError:
> >>>             it = tuple(it)
> >>>         try:
> >>>             i = self._randbelow(len(it))
> >>>         except ValueError:
> >>>             raise IndexError('Cannot choose from an empty iterable')
> >>>         return it[i]
> >>>
> >>>
> >>> This works on (e.g.) a list/tuple/string, a set, a dictionary view or a
> >>> generator.
> >>> Obviously the generator has to be 'fresh' (i.e. any previously consumed
> >>> values will be excluded from the choice) and 'throw-away'
> (random.choice
> >>> will exhaust it).
> >>> But it means you can write code like
> >>>     random.choice(x for x in xrange(10) if x%3)    # this feels quite
> >>> "Pythonic" to me!
> >>>
> >>> (I experimented with versions that, when passed an object that
> supported
> >>> len() but not indexing, viz. a set, iterated through the object as far
> as
> >>> necessary instead of converting it to a list.  But I found that this
> was
> >>> slower in practice as well as more complicated.  There might be a
> memory
> >>> issue with huge iterables, but we are no worse off using existing
> >>> workarounds for those.)
> >>>
> >>> Downsides of proposed version:
> >>>     (1) Backward compatibility.  Could mask an error if a "wrong"
> object,
> >>> but one that happens to be an iterable, is passed to random.choice.
> >>>           There is also slightly different behaviour if a dictionary
> (D) is
> >>> passed (an unusual thing to do):
> >>>                 The current version will get a random integer in the
> range
> >>> [0, len(D)), try to use that as a key of D,
> >>>                    and raise KeyError if that fails.
> >>>                 The proposed version behaves similarly except that it
> will
> >>> always raise "KeyError: 0" if 0 is not a key of D.
> >>>                 One could argue that this is an improvement, if it
> matters
> >>> at all (error reproducibility).
> >>>                 And code that deliberately exploits the existing
> behaviour,
> >>> provided that it uses a dictionary whose keys are consecutive integers
> from
> >>> 0 upwards, e.g.
> >>>                     D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 :
> 'three', 4 :
> >>> 'four' }
> >>>                     d = random.choice(D)    # equivalent, in this
> example,
> >>> to "d = random.choice(list(D.values()))"
> >>>                     if d == 'zero':
> >>>                         ...
> >>>                 will continue to work (although one might argue that it
> >>> deserves not to).
> >>>     (2) Speed - the proposed version is almost certainly slightly
> slower
> >>> when called with a sequence.
> >>>          For what it's worth, I tried to measure the difference on my
> >>> several-years-old Windows PC, but is was hard to measure accurately,
> even
> >>> over millions of iterations.
> >>>          All I can say is that it appeared to be a small fraction of a
> >>> millisecond, possibly in the region of 50 nanoseconds, per call.
> >>>
> >>> Best wishes
> >>> Rob Cliffe
> >>> _______________________________________________
> >>> Python-ideas mailing list
> >>> Python-ideas at python.org
> >>> https://mail.python.org/mailman/listinfo/python-ideas
> >>> Code of Conduct: http://python.org/psf/codeofconduct/
> >>
> >>
> >>
> >> --
> >> --Guido van Rossum (python.org/~guido)
> >> _______________________________________________
> >> Python-ideas mailing list
> >> Python-ideas at python.org
> >> https://mail.python.org/mailman/listinfo/python-ideas
> >> Code of Conduct: http://python.org/psf/codeofconduct/
> > _______________________________________________
> > Python-ideas mailing list
> > Python-ideas at python.org
> > https://mail.python.org/mailman/listinfo/python-ideas
> > Code of Conduct: http://python.org/psf/codeofconduct/
>
>
>
> --
> --Guido van Rossum (python.org/~guido)
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20160412/921752de/attachment-0001.html>