<div dir="ltr"><div><div>Reservoir sampling is super handy. For a Pythonic introduction, along with links to implementation, see here: <a href="https://www.paypal-engineering.com/2016/04/11/statistics-for-software/#dipping_into_the_stream">https://www.paypal-engineering.com/2016/04/11/statistics-for-software/#dipping_into_the_stream</a><br><br></div>Even though I love the elegance enough to write and illustrate about it, I'd be very surprised to have found it built into Python at this low of a level.<br><br></div>Mahmoud<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Apr 12, 2016 at 3:25 PM, Guido van Rossum <span dir="ltr"><<a href="mailto:guido@python.org" target="_blank">guido@python.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Ow, sorry! That's the algorithm I meant to reference (the link I<br>

actually gave is for a different situation). See also<br>

<a href="https://en.wikipedia.org/wiki/Reservoir_sampling" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/Reservoir_sampling</a>.<br>

<br>

It does indeed avoid storing a copy of the sequence -- at the cost of<br>

N calls to random(). Try timing it -- I wouldn't be surprised if<br>

copying a set of N elements into a tuple is a lot faster than N<br>

random() calls.<br>

<br>

So we're stuck with two alternatives, neither of which is always the<br>

best: (1) just copy the sequence (like Rob's proposal) -- this loses<br>

big time if the sequence is large and not already held in memory; (2)<br>

use reservoir sampling, at the cost of many more random() calls.<br>

<br>

Since Rob didn't link to the quoted conversation I don't have it handy<br>

to guess what Raymond meant with "Bentley" either, but I'm guessing<br>

Jon Bentley wrote about reservoir sampling (before it was named that).<br>

I recall hearing about the algorithm for the first time in the late<br>

'80s from a coworker with Bell Labs ties.<br>

<br>

(Later, Ethan wrote)<br>

<span class="">> So the objection is that because performance can vary widely it is better for users to select their own algorithm rather than rely on a one-size-fits-all stdlib solution?<br>

<br>

</span>That's pretty much it. Were this 1995, I probably would have put some<br>

toy in the stdlib. But these days we should be more careful than that.<br>

<span class="HOEnZb"><font color="#888888"><br>

--Guido<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

On Tue, Apr 12, 2016 at 3:09 PM, Marco Cognetta<br>

<<a href="mailto:cognetta.marco@gmail.com">cognetta.marco@gmail.com</a>> wrote:<br>

> Maybe I am misunderstanding but for a generator, couldn't you avoid<br>

> storing it in memory and just using the following online algorithm to<br>

> save space? <a href="http://stackoverflow.com/a/23352100" rel="noreferrer" target="_blank">http://stackoverflow.com/a/23352100</a><br>

><br>

> (I hope this message goes through correctly, first time participating<br>

> in a discussion on this mailing list).<br>

><br>

> -Marco<br>

><br>

> On Tue, Apr 12, 2016 at 5:53 PM, Guido van Rossum <<a href="mailto:guido@python.org">guido@python.org</a>> wrote:<br>

>> The problem with your version is that copying the input is slow if it is large.<br>

>><br>

>> Raymond was referencing the top answer here:<br>

>> <a href="http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-random-combination-of-values" rel="noreferrer" target="_blank">http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-random-combination-of-values</a>.<br>

>> It's also slow though (draws N random values if the input is of length<br>

>> N).<br>

>><br>

>> On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe <<a href="mailto:rob.cliffe@btinternet.com">rob.cliffe@btinternet.com</a>> wrote:<br>

>>> It surprised me a bit the first time I realised that random.choice did not<br>

>>> work on a set.  (One expects "everything" to "just work" in Python! :-) )<br>

>>> There is a simple workaround (convert the set to a tuple or list before<br>

>>> passing it) but ...<br>

>>><br>

>>> On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module'<br>

>>> on Python-Ideas, one of which was to allow random.choice to work on an<br>

>>> arbitrary iterable.<br>

>>> Raymond Hettinger commented<br>

>>>     "It could be generalized to arbitrary iterables (Bentley provides an<br>

>>> example of how to do this) but it is fragile (i.e. falls apart badly with<br>

>>> weak random number generators) and doesn't correspond well with real use<br>

>>> cases."<br>

>>> Sven replied<br>

>>>     "Again, there definitely are real world use cases ..."<br>

>>><br>

>>> I would like to revive this suggestion.  I don't understand Raymond's<br>

>>> comment.  It seems to me that a reasonable implementation is better and more<br>

>>> friendly than none, particularly for newbies.<br>

>>> This is the source code for random.choice in Python 3.2 (as far as I know it<br>

>>> hasn't changed since):<br>

>>><br>

>>><br>

>>>     def choice(self, seq):<br>

>>>         """Choose a random element from a non-empty sequence."""<br>

>>>         try:<br>

>>>             i = self._randbelow(len(seq))<br>

>>>         except ValueError:<br>

>>>             raise IndexError('Cannot choose from an empty sequence')<br>

>>>         return seq[i]<br>

>>><br>

>>><br>

>>> And here is my proposed (tested) version:<br>

>>><br>

>>><br>

>>>     def choice(self, it):<br>

>>>         """Choose a random element from a non-empty iterable."""<br>

>>>         try:<br>

>>>             it[0]<br>

>>>         except IndexError:<br>

>>>             raise IndexError('Cannot choose from an empty sequence')<br>

>>>         except TypeError:<br>

>>>             it = tuple(it)<br>

>>>         try:<br>

>>>             i = self._randbelow(len(it))<br>

>>>         except ValueError:<br>

>>>             raise IndexError('Cannot choose from an empty iterable')<br>

>>>         return it[i]<br>

>>><br>

>>><br>

>>> This works on (e.g.) a list/tuple/string, a set, a dictionary view or a<br>

>>> generator.<br>

>>> Obviously the generator has to be 'fresh' (i.e. any previously consumed<br>

>>> values will be excluded from the choice) and 'throw-away' (random.choice<br>

>>> will exhaust it).<br>

>>> But it means you can write code like<br>

>>>     random.choice(x for x in xrange(10) if x%3)    # this feels quite<br>

>>> "Pythonic" to me!<br>

>>><br>

>>> (I experimented with versions that, when passed an object that supported<br>

>>> len() but not indexing, viz. a set, iterated through the object as far as<br>

>>> necessary instead of converting it to a list.  But I found that this was<br>

>>> slower in practice as well as more complicated.  There might be a memory<br>

>>> issue with huge iterables, but we are no worse off using existing<br>

>>> workarounds for those.)<br>

>>><br>

>>> Downsides of proposed version:<br>

>>>     (1) Backward compatibility.  Could mask an error if a "wrong" object,<br>

>>> but one that happens to be an iterable, is passed to random.choice.<br>

>>>           There is also slightly different behaviour if a dictionary (D) is<br>

>>> passed (an unusual thing to do):<br>

>>>                 The current version will get a random integer in the range<br>

>>> [0, len(D)), try to use that as a key of D,<br>

>>>                    and raise KeyError if that fails.<br>

>>>                 The proposed version behaves similarly except that it will<br>

>>> always raise "KeyError: 0" if 0 is not a key of D.<br>

>>>                 One could argue that this is an improvement, if it matters<br>

>>> at all (error reproducibility).<br>

>>>                 And code that deliberately exploits the existing behaviour,<br>

>>> provided that it uses a dictionary whose keys are consecutive integers from<br>

>>> 0 upwards, e.g.<br>

>>>                     D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 : 'three', 4 :<br>

>>> 'four' }<br>

>>>                     d = random.choice(D)    # equivalent, in this example,<br>

>>> to "d = random.choice(list(D.values()))"<br>

>>>                     if d == 'zero':<br>

>>>                         ...<br>

>>>                 will continue to work (although one might argue that it<br>

>>> deserves not to).<br>

>>>     (2) Speed - the proposed version is almost certainly slightly slower<br>

>>> when called with a sequence.<br>

>>>          For what it's worth, I tried to measure the difference on my<br>

>>> several-years-old Windows PC, but is was hard to measure accurately, even<br>

>>> over millions of iterations.<br>

>>>          All I can say is that it appeared to be a small fraction of a<br>

>>> millisecond, possibly in the region of 50 nanoseconds, per call.<br>

>>><br>

>>> Best wishes<br>

>>> Rob Cliffe<br>

>>> _______________________________________________<br>

>>> Python-ideas mailing list<br>

>>> <a href="mailto:Python-ideas@python.org">Python-ideas@python.org</a><br>

>>> <a href="https://mail.python.org/mailman/listinfo/python-ideas" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/python-ideas</a><br>

>>> Code of Conduct: <a href="http://python.org/psf/codeofconduct/" rel="noreferrer" target="_blank">http://python.org/psf/codeofconduct/</a><br>

>><br>

>><br>

>><br>

>> --<br>

>> --Guido van Rossum (<a href="http://python.org/~guido" rel="noreferrer" target="_blank">python.org/~guido</a>)<br>

>> _______________________________________________<br>

>> Python-ideas mailing list<br>

>> <a href="mailto:Python-ideas@python.org">Python-ideas@python.org</a><br>

>> <a href="https://mail.python.org/mailman/listinfo/python-ideas" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/python-ideas</a><br>

>> Code of Conduct: <a href="http://python.org/psf/codeofconduct/" rel="noreferrer" target="_blank">http://python.org/psf/codeofconduct/</a><br>

> _______________________________________________<br>

> Python-ideas mailing list<br>

> <a href="mailto:Python-ideas@python.org">Python-ideas@python.org</a><br>

> <a href="https://mail.python.org/mailman/listinfo/python-ideas" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/python-ideas</a><br>

> Code of Conduct: <a href="http://python.org/psf/codeofconduct/" rel="noreferrer" target="_blank">http://python.org/psf/codeofconduct/</a><br>

<br>

<br>

<br>

--<br>

--Guido van Rossum (<a href="http://python.org/~guido" rel="noreferrer" target="_blank">python.org/~guido</a>)<br>

_______________________________________________<br>

Python-ideas mailing list<br>

<a href="mailto:Python-ideas@python.org">Python-ideas@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/python-ideas" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/python-ideas</a><br>

Code of Conduct: <a href="http://python.org/psf/codeofconduct/" rel="noreferrer" target="_blank">http://python.org/psf/codeofconduct/</a><br>

</div></div></blockquote></div><br></div>