Re: [Python-ideas] random.choice on non-sequence

12 Apr 2016

      Reservoir sampling is super handy. For a Pythonic introduction, along with
links to implementation, see here:
https://www.paypal-engineering.com/2016/04/11/statistics-for-software/#dippi...

Even though I love the elegance enough to write and illustrate about it,
I'd be very surprised to have found it built into Python at this low of a
level.

Mahmoud

On Tue, Apr 12, 2016 at 3:25 PM, Guido van Rossum <guido@python.org> wrote:
...
Ow, sorry! That's the algorithm I meant to reference (the link I
actually gave is for a different situation). See also
https://en.wikipedia.org/wiki/Reservoir_sampling.
It does indeed avoid storing a copy of the sequence -- at the cost of
N calls to random(). Try timing it -- I wouldn't be surprised if
copying a set of N elements into a tuple is a lot faster than N
random() calls.
So we're stuck with two alternatives, neither of which is always the
best: (1) just copy the sequence (like Rob's proposal) -- this loses
big time if the sequence is large and not already held in memory; (2)
use reservoir sampling, at the cost of many more random() calls.
Since Rob didn't link to the quoted conversation I don't have it handy
to guess what Raymond meant with "Bentley" either, but I'm guessing
Jon Bentley wrote about reservoir sampling (before it was named that).
I recall hearing about the algorithm for the first time in the late
'80s from a coworker with Bell Labs ties.
(Later, Ethan wrote)
...
So the objection is that because performance can vary widely it is
better for users to select their own algorithm rather than rely on a
one-size-fits-all stdlib solution?
That's pretty much it. Were this 1995, I probably would have put some
toy in the stdlib. But these days we should be more careful than that.
--Guido
On Tue, Apr 12, 2016 at 3:09 PM, Marco Cognetta
<cognetta.marco@gmail.com> wrote:
...
Maybe I am misunderstanding but for a generator, couldn't you avoid
storing it in memory and just using the following online algorithm to
save space? http://stackoverflow.com/a/23352100
(I hope this message goes through correctly, first time participating
in a discussion on this mailing list).
-Marco
On Tue, Apr 12, 2016 at 5:53 PM, Guido van Rossum <guido@python.org>
wrote:
...
The problem with your version is that copying the input is slow if it
is large.
Raymond was referencing the top answer here:
http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-rand...
.
...
It's also slow though (draws N random values if the input is of length
N).
On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe <rob.cliffe@btinternet.com>
wrote:
...
It surprised me a bit the first time I realised that random.choice did
not
work on a set.  (One expects "everything" to "just work" in Python!
:-) )
There is a simple workaround (convert the set to a tuple or list before
passing it) but ...
On 2011-06-23, Sven Marnach posted 'A few suggestions for the random
module'
on Python-Ideas, one of which was to allow random.choice to work on an
arbitrary iterable.
Raymond Hettinger commented
    "It could be generalized to arbitrary iterables (Bentley provides
an
example of how to do this) but it is fragile (i.e. falls apart badly
with
weak random number generators) and doesn't correspond well with real
use
cases."
Sven replied
    "Again, there definitely are real world use cases ..."
I would like to revive this suggestion.  I don't understand Raymond's
comment.  It seems to me that a reasonable implementation is better
and more
friendly than none, particularly for newbies.
This is the source code for random.choice in Python 3.2 (as far as I
know it
hasn't changed since):
def choice(self, seq):
        """Choose a random element from a non-empty sequence."""
        try:
            i = self._randbelow(len(seq))
        except ValueError:
            raise IndexError('Cannot choose from an empty sequence')
        return seq[i]
And here is my proposed (tested) version:
def choice(self, it):
        """Choose a random element from a non-empty iterable."""
        try:
            it[0]
        except IndexError:
            raise IndexError('Cannot choose from an empty sequence')
        except TypeError:
            it = tuple(it)
        try:
            i = self._randbelow(len(it))
        except ValueError:
            raise IndexError('Cannot choose from an empty iterable')
        return it[i]
This works on (e.g.) a list/tuple/string, a set, a dictionary view or a
generator.
Obviously the generator has to be 'fresh' (i.e. any previously consumed
values will be excluded from the choice) and 'throw-away'
(random.choice
will exhaust it).
But it means you can write code like
    random.choice(x for x in xrange(10) if x%3)    # this feels quite
"Pythonic" to me!
(I experimented with versions that, when passed an object that
supported
len() but not indexing, viz. a set, iterated through the object as far
as
necessary instead of converting it to a list.  But I found that this
was
slower in practice as well as more complicated.  There might be a
memory
issue with huge iterables, but we are no worse off using existing
workarounds for those.)
Downsides of proposed version:
    (1) Backward compatibility.  Could mask an error if a "wrong"
object,
but one that happens to be an iterable, is passed to random.choice.
          There is also slightly different behaviour if a dictionary
(D) is
passed (an unusual thing to do):
                The current version will get a random integer in the
range
[0, len(D)), try to use that as a key of D,
                   and raise KeyError if that fails.
                The proposed version behaves similarly except that it
will
always raise "KeyError: 0" if 0 is not a key of D.
                One could argue that this is an improvement, if it
matters
at all (error reproducibility).
                And code that deliberately exploits the existing
behaviour,
provided that it uses a dictionary whose keys are consecutive integers
from
0 upwards, e.g.
                    D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 :
'three', 4 :
'four' }
                    d = random.choice(D)    # equivalent, in this
example,
to "d = random.choice(list(D.values()))"
                    if d == 'zero':
                        ...
                will continue to work (although one might argue that it
deserves not to).
    (2) Speed - the proposed version is almost certainly slightly
slower
when called with a sequence.
         For what it's worth, I tried to measure the difference on my
several-years-old Windows PC, but is was hard to measure accurately,
even
over millions of iterations.
         All I can say is that it appeared to be a small fraction of a
millisecond, possibly in the region of 50 nanoseconds, per call.
Best wishes
Rob Cliffe
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/
--
--Guido van Rossum (python.org/~guido)
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/
--
--Guido van Rossum (python.org/~guido)
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] random.choice on non-sequence

Mahmoud Hashemi