Reservoir sampling is super handy. For a Pythonic introduction, along with links to implementation, see here: https://www.paypal-engineering.com/2016/04/11/statistics-for-software/#dippi... Even though I love the elegance enough to write and illustrate about it, I'd be very surprised to have found it built into Python at this low of a level. Mahmoud On Tue, Apr 12, 2016 at 3:25 PM, Guido van Rossum <guido@python.org> wrote:
Ow, sorry! That's the algorithm I meant to reference (the link I actually gave is for a different situation). See also https://en.wikipedia.org/wiki/Reservoir_sampling.
It does indeed avoid storing a copy of the sequence -- at the cost of N calls to random(). Try timing it -- I wouldn't be surprised if copying a set of N elements into a tuple is a lot faster than N random() calls.
So we're stuck with two alternatives, neither of which is always the best: (1) just copy the sequence (like Rob's proposal) -- this loses big time if the sequence is large and not already held in memory; (2) use reservoir sampling, at the cost of many more random() calls.
Since Rob didn't link to the quoted conversation I don't have it handy to guess what Raymond meant with "Bentley" either, but I'm guessing Jon Bentley wrote about reservoir sampling (before it was named that). I recall hearing about the algorithm for the first time in the late '80s from a coworker with Bell Labs ties.
(Later, Ethan wrote)
So the objection is that because performance can vary widely it is better for users to select their own algorithm rather than rely on a one-size-fits-all stdlib solution?
That's pretty much it. Were this 1995, I probably would have put some toy in the stdlib. But these days we should be more careful than that.
--Guido
On Tue, Apr 12, 2016 at 3:09 PM, Marco Cognetta <cognetta.marco@gmail.com> wrote:
Maybe I am misunderstanding but for a generator, couldn't you avoid storing it in memory and just using the following online algorithm to save space? http://stackoverflow.com/a/23352100
(I hope this message goes through correctly, first time participating in a discussion on this mailing list).
-Marco
On Tue, Apr 12, 2016 at 5:53 PM, Guido van Rossum <guido@python.org> wrote:
The problem with your version is that copying the input is slow if it is large.
Raymond was referencing the top answer here:
http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-rand... .
It's also slow though (draws N random values if the input is of length N).
On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe <rob.cliffe@btinternet.com> wrote:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) ) There is a simple workaround (convert the set to a tuple or list before passing it) but ...
On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module' on Python-Ideas, one of which was to allow random.choice to work on an arbitrary iterable. Raymond Hettinger commented "It could be generalized to arbitrary iterables (Bentley provides an example of how to do this) but it is fragile (i.e. falls apart badly with weak random number generators) and doesn't correspond well with real use cases." Sven replied "Again, there definitely are real world use cases ..."
I would like to revive this suggestion. I don't understand Raymond's comment. It seems to me that a reasonable implementation is better and more friendly than none, particularly for newbies. This is the source code for random.choice in Python 3.2 (as far as I know it hasn't changed since):
def choice(self, seq): """Choose a random element from a non-empty sequence.""" try: i = self._randbelow(len(seq)) except ValueError: raise IndexError('Cannot choose from an empty sequence') return seq[i]
And here is my proposed (tested) version:
def choice(self, it): """Choose a random element from a non-empty iterable.""" try: it[0] except IndexError: raise IndexError('Cannot choose from an empty sequence') except TypeError: it = tuple(it) try: i = self._randbelow(len(it)) except ValueError: raise IndexError('Cannot choose from an empty iterable') return it[i]
This works on (e.g.) a list/tuple/string, a set, a dictionary view or a generator. Obviously the generator has to be 'fresh' (i.e. any previously consumed values will be excluded from the choice) and 'throw-away' (random.choice will exhaust it). But it means you can write code like random.choice(x for x in xrange(10) if x%3) # this feels quite "Pythonic" to me!
(I experimented with versions that, when passed an object that supported len() but not indexing, viz. a set, iterated through the object as far as necessary instead of converting it to a list. But I found that this was slower in practice as well as more complicated. There might be a memory issue with huge iterables, but we are no worse off using existing workarounds for those.)
Downsides of proposed version: (1) Backward compatibility. Could mask an error if a "wrong" object, but one that happens to be an iterable, is passed to random.choice. There is also slightly different behaviour if a dictionary (D) is passed (an unusual thing to do): The current version will get a random integer in the range [0, len(D)), try to use that as a key of D, and raise KeyError if that fails. The proposed version behaves similarly except that it will always raise "KeyError: 0" if 0 is not a key of D. One could argue that this is an improvement, if it matters at all (error reproducibility). And code that deliberately exploits the existing behaviour, provided that it uses a dictionary whose keys are consecutive integers from 0 upwards, e.g. D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 : 'three', 4 : 'four' } d = random.choice(D) # equivalent, in this example, to "d = random.choice(list(D.values()))" if d == 'zero': ... will continue to work (although one might argue that it deserves not to). (2) Speed - the proposed version is almost certainly slightly slower when called with a sequence. For what it's worth, I tried to measure the difference on my several-years-old Windows PC, but is was hard to measure accurately, even over millions of iterations. All I can say is that it appeared to be a small fraction of a millisecond, possibly in the region of 50 nanoseconds, per call.
Best wishes Rob Cliffe _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/