
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) ) There is a simple workaround (convert the set to a tuple or list before passing it) but ...
On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module' on Python-Ideas, one of which was to allow random.choice to work on an arbitrary iterable. Raymond Hettinger commented "It could be generalized to arbitrary iterables (Bentley provides an example of how to do this) but it is fragile (i.e. falls apart badly with weak random number generators) and doesn't correspond well with real use cases." Sven replied "Again, there definitely are real world use cases ..."
I would like to revive this suggestion. I don't understand Raymond's comment. It seems to me that a reasonable implementation is better and more friendly than none, particularly for newbies. This is the source code for random.choice in Python 3.2 (as far as I know it hasn't changed since):
def choice(self, seq): """Choose a random element from a non-empty sequence.""" try: i = self._randbelow(len(seq)) except ValueError: raise IndexError('Cannot choose from an empty sequence') return seq[i]
And here is my proposed (tested) version:
def choice(self, it): """Choose a random element from a non-empty iterable.""" try: it[0] except IndexError: raise IndexError('Cannot choose from an empty sequence') except TypeError: it = tuple(it) try: i = self._randbelow(len(it)) except ValueError: raise IndexError('Cannot choose from an empty iterable') return it[i]
This works on (e.g.) a list/tuple/string, a set, a dictionary view or a generator. Obviously the generator has to be 'fresh' (i.e. any previously consumed values will be excluded from the choice) and 'throw-away' (random.choice will exhaust it). But it means you can write code like random.choice(x for x in xrange(10) if x%3) # this feels quite "Pythonic" to me!
(I experimented with versions that, when passed an object that supported len() but not indexing, viz. a set, iterated through the object as far as necessary instead of converting it to a list. But I found that this was slower in practice as well as more complicated. There might be a memory issue with huge iterables, but we are no worse off using existing workarounds for those.)
Downsides of proposed version: (1) Backward compatibility. Could mask an error if a "wrong" object, but one that happens to be an iterable, is passed to random.choice. There is also slightly different behaviour if a dictionary (D) is passed (an unusual thing to do): The current version will get a random integer in the range [0, len(D)), try to use that as a key of D, and raise KeyError if that fails. The proposed version behaves similarly except that it will always raise "KeyError: 0" if 0 is not a key of D. One could argue that this is an improvement, if it matters at all (error reproducibility). And code that deliberately exploits the existing behaviour, provided that it uses a dictionary whose keys are consecutive integers from 0 upwards, e.g. D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 : 'three', 4 : 'four' } d = random.choice(D) # equivalent, in this example, to "d = random.choice(list(D.values()))" if d == 'zero': ... will continue to work (although one might argue that it deserves not to). (2) Speed - the proposed version is almost certainly slightly slower when called with a sequence. For what it's worth, I tried to measure the difference on my several-years-old Windows PC, but is was hard to measure accurately, even over millions of iterations. All I can say is that it appeared to be a small fraction of a millisecond, possibly in the region of 50 nanoseconds, per call.
Best wishes Rob Cliffe

The problem with your version is that copying the input is slow if it is large.
Raymond was referencing the top answer here: http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-rand.... It's also slow though (draws N random values if the input is of length N).
On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe rob.cliffe@btinternet.com wrote:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) ) There is a simple workaround (convert the set to a tuple or list before passing it) but ...
On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module' on Python-Ideas, one of which was to allow random.choice to work on an arbitrary iterable. Raymond Hettinger commented "It could be generalized to arbitrary iterables (Bentley provides an example of how to do this) but it is fragile (i.e. falls apart badly with weak random number generators) and doesn't correspond well with real use cases." Sven replied "Again, there definitely are real world use cases ..."
I would like to revive this suggestion. I don't understand Raymond's comment. It seems to me that a reasonable implementation is better and more friendly than none, particularly for newbies. This is the source code for random.choice in Python 3.2 (as far as I know it hasn't changed since):
def choice(self, seq): """Choose a random element from a non-empty sequence.""" try: i = self._randbelow(len(seq)) except ValueError: raise IndexError('Cannot choose from an empty sequence') return seq[i]
And here is my proposed (tested) version:
def choice(self, it): """Choose a random element from a non-empty iterable.""" try: it[0] except IndexError: raise IndexError('Cannot choose from an empty sequence') except TypeError: it = tuple(it) try: i = self._randbelow(len(it)) except ValueError: raise IndexError('Cannot choose from an empty iterable') return it[i]
This works on (e.g.) a list/tuple/string, a set, a dictionary view or a generator. Obviously the generator has to be 'fresh' (i.e. any previously consumed values will be excluded from the choice) and 'throw-away' (random.choice will exhaust it). But it means you can write code like random.choice(x for x in xrange(10) if x%3) # this feels quite "Pythonic" to me!
(I experimented with versions that, when passed an object that supported len() but not indexing, viz. a set, iterated through the object as far as necessary instead of converting it to a list. But I found that this was slower in practice as well as more complicated. There might be a memory issue with huge iterables, but we are no worse off using existing workarounds for those.)
Downsides of proposed version: (1) Backward compatibility. Could mask an error if a "wrong" object, but one that happens to be an iterable, is passed to random.choice. There is also slightly different behaviour if a dictionary (D) is passed (an unusual thing to do): The current version will get a random integer in the range [0, len(D)), try to use that as a key of D, and raise KeyError if that fails. The proposed version behaves similarly except that it will always raise "KeyError: 0" if 0 is not a key of D. One could argue that this is an improvement, if it matters at all (error reproducibility). And code that deliberately exploits the existing behaviour, provided that it uses a dictionary whose keys are consecutive integers from 0 upwards, e.g. D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 : 'three', 4 : 'four' } d = random.choice(D) # equivalent, in this example, to "d = random.choice(list(D.values()))" if d == 'zero': ... will continue to work (although one might argue that it deserves not to). (2) Speed - the proposed version is almost certainly slightly slower when called with a sequence. For what it's worth, I tried to measure the difference on my several-years-old Windows PC, but is was hard to measure accurately, even over millions of iterations. All I can say is that it appeared to be a small fraction of a millisecond, possibly in the region of 50 nanoseconds, per call.
Best wishes Rob Cliffe _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Maybe I am misunderstanding but for a generator, couldn't you avoid storing it in memory and just using the following online algorithm to save space? http://stackoverflow.com/a/23352100
(I hope this message goes through correctly, first time participating in a discussion on this mailing list).
-Marco
On Tue, Apr 12, 2016 at 5:53 PM, Guido van Rossum guido@python.org wrote:
The problem with your version is that copying the input is slow if it is large.
Raymond was referencing the top answer here: http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-rand.... It's also slow though (draws N random values if the input is of length N).
On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe rob.cliffe@btinternet.com wrote:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) ) There is a simple workaround (convert the set to a tuple or list before passing it) but ...
On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module' on Python-Ideas, one of which was to allow random.choice to work on an arbitrary iterable. Raymond Hettinger commented "It could be generalized to arbitrary iterables (Bentley provides an example of how to do this) but it is fragile (i.e. falls apart badly with weak random number generators) and doesn't correspond well with real use cases." Sven replied "Again, there definitely are real world use cases ..."
I would like to revive this suggestion. I don't understand Raymond's comment. It seems to me that a reasonable implementation is better and more friendly than none, particularly for newbies. This is the source code for random.choice in Python 3.2 (as far as I know it hasn't changed since):
def choice(self, seq): """Choose a random element from a non-empty sequence.""" try: i = self._randbelow(len(seq)) except ValueError: raise IndexError('Cannot choose from an empty sequence') return seq[i]
And here is my proposed (tested) version:
def choice(self, it): """Choose a random element from a non-empty iterable.""" try: it[0] except IndexError: raise IndexError('Cannot choose from an empty sequence') except TypeError: it = tuple(it) try: i = self._randbelow(len(it)) except ValueError: raise IndexError('Cannot choose from an empty iterable') return it[i]
This works on (e.g.) a list/tuple/string, a set, a dictionary view or a generator. Obviously the generator has to be 'fresh' (i.e. any previously consumed values will be excluded from the choice) and 'throw-away' (random.choice will exhaust it). But it means you can write code like random.choice(x for x in xrange(10) if x%3) # this feels quite "Pythonic" to me!
(I experimented with versions that, when passed an object that supported len() but not indexing, viz. a set, iterated through the object as far as necessary instead of converting it to a list. But I found that this was slower in practice as well as more complicated. There might be a memory issue with huge iterables, but we are no worse off using existing workarounds for those.)
Downsides of proposed version: (1) Backward compatibility. Could mask an error if a "wrong" object, but one that happens to be an iterable, is passed to random.choice. There is also slightly different behaviour if a dictionary (D) is passed (an unusual thing to do): The current version will get a random integer in the range [0, len(D)), try to use that as a key of D, and raise KeyError if that fails. The proposed version behaves similarly except that it will always raise "KeyError: 0" if 0 is not a key of D. One could argue that this is an improvement, if it matters at all (error reproducibility). And code that deliberately exploits the existing behaviour, provided that it uses a dictionary whose keys are consecutive integers from 0 upwards, e.g. D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 : 'three', 4 : 'four' } d = random.choice(D) # equivalent, in this example, to "d = random.choice(list(D.values()))" if d == 'zero': ... will continue to work (although one might argue that it deserves not to). (2) Speed - the proposed version is almost certainly slightly slower when called with a sequence. For what it's worth, I tried to measure the difference on my several-years-old Windows PC, but is was hard to measure accurately, even over millions of iterations. All I can say is that it appeared to be a small fraction of a millisecond, possibly in the region of 50 nanoseconds, per call.
Best wishes Rob Cliffe _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Ow, sorry! That's the algorithm I meant to reference (the link I actually gave is for a different situation). See also https://en.wikipedia.org/wiki/Reservoir_sampling.
It does indeed avoid storing a copy of the sequence -- at the cost of N calls to random(). Try timing it -- I wouldn't be surprised if copying a set of N elements into a tuple is a lot faster than N random() calls.
So we're stuck with two alternatives, neither of which is always the best: (1) just copy the sequence (like Rob's proposal) -- this loses big time if the sequence is large and not already held in memory; (2) use reservoir sampling, at the cost of many more random() calls.
Since Rob didn't link to the quoted conversation I don't have it handy to guess what Raymond meant with "Bentley" either, but I'm guessing Jon Bentley wrote about reservoir sampling (before it was named that). I recall hearing about the algorithm for the first time in the late '80s from a coworker with Bell Labs ties.
(Later, Ethan wrote)
So the objection is that because performance can vary widely it is better for users to select their own algorithm rather than rely on a one-size-fits-all stdlib solution?
That's pretty much it. Were this 1995, I probably would have put some toy in the stdlib. But these days we should be more careful than that.
--Guido
On Tue, Apr 12, 2016 at 3:09 PM, Marco Cognetta cognetta.marco@gmail.com wrote:
Maybe I am misunderstanding but for a generator, couldn't you avoid storing it in memory and just using the following online algorithm to save space? http://stackoverflow.com/a/23352100
(I hope this message goes through correctly, first time participating in a discussion on this mailing list).
-Marco
On Tue, Apr 12, 2016 at 5:53 PM, Guido van Rossum guido@python.org wrote:
The problem with your version is that copying the input is slow if it is large.
Raymond was referencing the top answer here: http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-rand.... It's also slow though (draws N random values if the input is of length N).
On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe rob.cliffe@btinternet.com wrote:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) ) There is a simple workaround (convert the set to a tuple or list before passing it) but ...
On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module' on Python-Ideas, one of which was to allow random.choice to work on an arbitrary iterable. Raymond Hettinger commented "It could be generalized to arbitrary iterables (Bentley provides an example of how to do this) but it is fragile (i.e. falls apart badly with weak random number generators) and doesn't correspond well with real use cases." Sven replied "Again, there definitely are real world use cases ..."
I would like to revive this suggestion. I don't understand Raymond's comment. It seems to me that a reasonable implementation is better and more friendly than none, particularly for newbies. This is the source code for random.choice in Python 3.2 (as far as I know it hasn't changed since):
def choice(self, seq): """Choose a random element from a non-empty sequence.""" try: i = self._randbelow(len(seq)) except ValueError: raise IndexError('Cannot choose from an empty sequence') return seq[i]
And here is my proposed (tested) version:
def choice(self, it): """Choose a random element from a non-empty iterable.""" try: it[0] except IndexError: raise IndexError('Cannot choose from an empty sequence') except TypeError: it = tuple(it) try: i = self._randbelow(len(it)) except ValueError: raise IndexError('Cannot choose from an empty iterable') return it[i]
This works on (e.g.) a list/tuple/string, a set, a dictionary view or a generator. Obviously the generator has to be 'fresh' (i.e. any previously consumed values will be excluded from the choice) and 'throw-away' (random.choice will exhaust it). But it means you can write code like random.choice(x for x in xrange(10) if x%3) # this feels quite "Pythonic" to me!
(I experimented with versions that, when passed an object that supported len() but not indexing, viz. a set, iterated through the object as far as necessary instead of converting it to a list. But I found that this was slower in practice as well as more complicated. There might be a memory issue with huge iterables, but we are no worse off using existing workarounds for those.)
Downsides of proposed version: (1) Backward compatibility. Could mask an error if a "wrong" object, but one that happens to be an iterable, is passed to random.choice. There is also slightly different behaviour if a dictionary (D) is passed (an unusual thing to do): The current version will get a random integer in the range [0, len(D)), try to use that as a key of D, and raise KeyError if that fails. The proposed version behaves similarly except that it will always raise "KeyError: 0" if 0 is not a key of D. One could argue that this is an improvement, if it matters at all (error reproducibility). And code that deliberately exploits the existing behaviour, provided that it uses a dictionary whose keys are consecutive integers from 0 upwards, e.g. D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 : 'three', 4 : 'four' } d = random.choice(D) # equivalent, in this example, to "d = random.choice(list(D.values()))" if d == 'zero': ... will continue to work (although one might argue that it deserves not to). (2) Speed - the proposed version is almost certainly slightly slower when called with a sequence. For what it's worth, I tried to measure the difference on my several-years-old Windows PC, but is was hard to measure accurately, even over millions of iterations. All I can say is that it appeared to be a small fraction of a millisecond, possibly in the region of 50 nanoseconds, per call.
Best wishes Rob Cliffe _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Reservoir sampling is super handy. For a Pythonic introduction, along with links to implementation, see here: https://www.paypal-engineering.com/2016/04/11/statistics-for-software/#dippi...
Even though I love the elegance enough to write and illustrate about it, I'd be very surprised to have found it built into Python at this low of a level.
Mahmoud
On Tue, Apr 12, 2016 at 3:25 PM, Guido van Rossum guido@python.org wrote:
Ow, sorry! That's the algorithm I meant to reference (the link I actually gave is for a different situation). See also https://en.wikipedia.org/wiki/Reservoir_sampling.
It does indeed avoid storing a copy of the sequence -- at the cost of N calls to random(). Try timing it -- I wouldn't be surprised if copying a set of N elements into a tuple is a lot faster than N random() calls.
So we're stuck with two alternatives, neither of which is always the best: (1) just copy the sequence (like Rob's proposal) -- this loses big time if the sequence is large and not already held in memory; (2) use reservoir sampling, at the cost of many more random() calls.
Since Rob didn't link to the quoted conversation I don't have it handy to guess what Raymond meant with "Bentley" either, but I'm guessing Jon Bentley wrote about reservoir sampling (before it was named that). I recall hearing about the algorithm for the first time in the late '80s from a coworker with Bell Labs ties.
(Later, Ethan wrote)
So the objection is that because performance can vary widely it is
better for users to select their own algorithm rather than rely on a one-size-fits-all stdlib solution?
That's pretty much it. Were this 1995, I probably would have put some toy in the stdlib. But these days we should be more careful than that.
--Guido
On Tue, Apr 12, 2016 at 3:09 PM, Marco Cognetta cognetta.marco@gmail.com wrote:
Maybe I am misunderstanding but for a generator, couldn't you avoid storing it in memory and just using the following online algorithm to save space? http://stackoverflow.com/a/23352100
(I hope this message goes through correctly, first time participating in a discussion on this mailing list).
-Marco
On Tue, Apr 12, 2016 at 5:53 PM, Guido van Rossum guido@python.org
wrote:
The problem with your version is that copying the input is slow if it
is large.
Raymond was referencing the top answer here:
http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-rand... .
It's also slow though (draws N random values if the input is of length N).
On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe rob.cliffe@btinternet.com
wrote:
It surprised me a bit the first time I realised that random.choice did
not
work on a set. (One expects "everything" to "just work" in Python!
:-) )
There is a simple workaround (convert the set to a tuple or list before passing it) but ...
On 2011-06-23, Sven Marnach posted 'A few suggestions for the random
module'
on Python-Ideas, one of which was to allow random.choice to work on an arbitrary iterable. Raymond Hettinger commented "It could be generalized to arbitrary iterables (Bentley provides
an
example of how to do this) but it is fragile (i.e. falls apart badly
with
weak random number generators) and doesn't correspond well with real
use
cases." Sven replied "Again, there definitely are real world use cases ..."
I would like to revive this suggestion. I don't understand Raymond's comment. It seems to me that a reasonable implementation is better
and more
friendly than none, particularly for newbies. This is the source code for random.choice in Python 3.2 (as far as I
know it
hasn't changed since):
def choice(self, seq): """Choose a random element from a non-empty sequence.""" try: i = self._randbelow(len(seq)) except ValueError: raise IndexError('Cannot choose from an empty sequence') return seq[i]
And here is my proposed (tested) version:
def choice(self, it): """Choose a random element from a non-empty iterable.""" try: it[0] except IndexError: raise IndexError('Cannot choose from an empty sequence') except TypeError: it = tuple(it) try: i = self._randbelow(len(it)) except ValueError: raise IndexError('Cannot choose from an empty iterable') return it[i]
This works on (e.g.) a list/tuple/string, a set, a dictionary view or a generator. Obviously the generator has to be 'fresh' (i.e. any previously consumed values will be excluded from the choice) and 'throw-away'
(random.choice
will exhaust it). But it means you can write code like random.choice(x for x in xrange(10) if x%3) # this feels quite "Pythonic" to me!
(I experimented with versions that, when passed an object that
supported
len() but not indexing, viz. a set, iterated through the object as far
as
necessary instead of converting it to a list. But I found that this
was
slower in practice as well as more complicated. There might be a
memory
issue with huge iterables, but we are no worse off using existing workarounds for those.)
Downsides of proposed version: (1) Backward compatibility. Could mask an error if a "wrong"
object,
but one that happens to be an iterable, is passed to random.choice. There is also slightly different behaviour if a dictionary
(D) is
passed (an unusual thing to do): The current version will get a random integer in the
range
[0, len(D)), try to use that as a key of D, and raise KeyError if that fails. The proposed version behaves similarly except that it
will
always raise "KeyError: 0" if 0 is not a key of D. One could argue that this is an improvement, if it
matters
at all (error reproducibility). And code that deliberately exploits the existing
behaviour,
provided that it uses a dictionary whose keys are consecutive integers
from
0 upwards, e.g. D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 :
'three', 4 :
'four' } d = random.choice(D) # equivalent, in this
example,
to "d = random.choice(list(D.values()))" if d == 'zero': ... will continue to work (although one might argue that it deserves not to). (2) Speed - the proposed version is almost certainly slightly
slower
when called with a sequence. For what it's worth, I tried to measure the difference on my several-years-old Windows PC, but is was hard to measure accurately,
even
over millions of iterations. All I can say is that it appeared to be a small fraction of a millisecond, possibly in the region of 50 nanoseconds, per call.
Best wishes Rob Cliffe _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

At the risk of making this not very easily accessible, couldn't we use both approaches (reservoir sampling and just loading the whole thing into memory) and apply the ski-rental problem? We could come up with some cost for calling random and use that to determine when we should just give up and load it into memory if we do not know the length of the sequence produced by the generator. If the user knows something about the generator beforehand they could specify in some optional parameter whether or not they want to use reservoir sampling or just load it into memory at the start.
The theory side of me likes this but I admit its a little ugly to be included in Python's stdlib.
-Marco
On Tue, Apr 12, 2016 at 7:05 PM, Mahmoud Hashemi mahmoud@hatnote.com wrote:
Reservoir sampling is super handy. For a Pythonic introduction, along with links to implementation, see here: https://www.paypal-engineering.com/2016/04/11/statistics-for-software/#dippi...
Even though I love the elegance enough to write and illustrate about it, I'd be very surprised to have found it built into Python at this low of a level.
Mahmoud
On Tue, Apr 12, 2016 at 3:25 PM, Guido van Rossum guido@python.org wrote:
Ow, sorry! That's the algorithm I meant to reference (the link I actually gave is for a different situation). See also https://en.wikipedia.org/wiki/Reservoir_sampling.
It does indeed avoid storing a copy of the sequence -- at the cost of N calls to random(). Try timing it -- I wouldn't be surprised if copying a set of N elements into a tuple is a lot faster than N random() calls.
So we're stuck with two alternatives, neither of which is always the best: (1) just copy the sequence (like Rob's proposal) -- this loses big time if the sequence is large and not already held in memory; (2) use reservoir sampling, at the cost of many more random() calls.
Since Rob didn't link to the quoted conversation I don't have it handy to guess what Raymond meant with "Bentley" either, but I'm guessing Jon Bentley wrote about reservoir sampling (before it was named that). I recall hearing about the algorithm for the first time in the late '80s from a coworker with Bell Labs ties.
(Later, Ethan wrote)
So the objection is that because performance can vary widely it is better for users to select their own algorithm rather than rely on a one-size-fits-all stdlib solution?
That's pretty much it. Were this 1995, I probably would have put some toy in the stdlib. But these days we should be more careful than that.
--Guido
On Tue, Apr 12, 2016 at 3:09 PM, Marco Cognetta cognetta.marco@gmail.com wrote:
Maybe I am misunderstanding but for a generator, couldn't you avoid storing it in memory and just using the following online algorithm to save space? http://stackoverflow.com/a/23352100
(I hope this message goes through correctly, first time participating in a discussion on this mailing list).
-Marco
On Tue, Apr 12, 2016 at 5:53 PM, Guido van Rossum guido@python.org wrote:
The problem with your version is that copying the input is slow if it is large.
Raymond was referencing the top answer here:
http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-rand.... It's also slow though (draws N random values if the input is of length N).
On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe rob.cliffe@btinternet.com wrote:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) ) There is a simple workaround (convert the set to a tuple or list before passing it) but ...
On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module' on Python-Ideas, one of which was to allow random.choice to work on an arbitrary iterable. Raymond Hettinger commented "It could be generalized to arbitrary iterables (Bentley provides an example of how to do this) but it is fragile (i.e. falls apart badly with weak random number generators) and doesn't correspond well with real use cases." Sven replied "Again, there definitely are real world use cases ..."
I would like to revive this suggestion. I don't understand Raymond's comment. It seems to me that a reasonable implementation is better and more friendly than none, particularly for newbies. This is the source code for random.choice in Python 3.2 (as far as I know it hasn't changed since):
def choice(self, seq): """Choose a random element from a non-empty sequence.""" try: i = self._randbelow(len(seq)) except ValueError: raise IndexError('Cannot choose from an empty sequence') return seq[i]
And here is my proposed (tested) version:
def choice(self, it): """Choose a random element from a non-empty iterable.""" try: it[0] except IndexError: raise IndexError('Cannot choose from an empty sequence') except TypeError: it = tuple(it) try: i = self._randbelow(len(it)) except ValueError: raise IndexError('Cannot choose from an empty iterable') return it[i]
This works on (e.g.) a list/tuple/string, a set, a dictionary view or a generator. Obviously the generator has to be 'fresh' (i.e. any previously consumed values will be excluded from the choice) and 'throw-away' (random.choice will exhaust it). But it means you can write code like random.choice(x for x in xrange(10) if x%3) # this feels quite "Pythonic" to me!
(I experimented with versions that, when passed an object that supported len() but not indexing, viz. a set, iterated through the object as far as necessary instead of converting it to a list. But I found that this was slower in practice as well as more complicated. There might be a memory issue with huge iterables, but we are no worse off using existing workarounds for those.)
Downsides of proposed version: (1) Backward compatibility. Could mask an error if a "wrong" object, but one that happens to be an iterable, is passed to random.choice. There is also slightly different behaviour if a dictionary (D) is passed (an unusual thing to do): The current version will get a random integer in the range [0, len(D)), try to use that as a key of D, and raise KeyError if that fails. The proposed version behaves similarly except that it will always raise "KeyError: 0" if 0 is not a key of D. One could argue that this is an improvement, if it matters at all (error reproducibility). And code that deliberately exploits the existing behaviour, provided that it uses a dictionary whose keys are consecutive integers from 0 upwards, e.g. D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 : 'three', 4 : 'four' } d = random.choice(D) # equivalent, in this example, to "d = random.choice(list(D.values()))" if d == 'zero': ... will continue to work (although one might argue that it deserves not to). (2) Speed - the proposed version is almost certainly slightly slower when called with a sequence. For what it's worth, I tried to measure the difference on my several-years-old Windows PC, but is was hard to measure accurately, even over millions of iterations. All I can say is that it appeared to be a small fraction of a millisecond, possibly in the region of 50 nanoseconds, per call.
Best wishes Rob Cliffe _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 04/12/2016 02:53 PM, Guido van Rossum wrote:
The problem with your version is that copying the input is slow if it is large.
Raymond was referencing the top answer here: http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-rand.... It's also slow though (draws N random values if the input is of length N).
So the objection is that because performance can vary widely it is better for users to select their own algorithm rather than rely on a one-size-fits-all stdlib solution?
-- ~Ethan~

Guido van Rossum wrote:
The problem with your version is that copying the input is slow if it is large.
Maybe sets should have a method that returns an indexable view? The order could be defined as equivalent to iteration order, and it would allow things like random.choice to work efficiently on sets.

On Tue, Apr 12, 2016 at 6:42 PM, Greg Ewing greg.ewing@canterbury.ac.nz wrote:
Maybe sets should have a method that returns an indexable view? The order could be defined as equivalent to iteration order, and it would allow things like random.choice to work efficiently on sets.
How would you implement it though? There are gaps in the hash table.

[Greg Ewing]
Maybe sets should have a method that returns an indexable view? The order could be defined as equivalent to iteration order, and it would allow things like random.choice to work efficiently on sets.
Well, sets and dicts are implemented very similarly in CPython. Note that a dict view (whether of keys, values or items) doesn't support indexing! It's unclear how that could be added efficiently, and the same applies to sets. Iteration works pretty efficiently, because a hidden "search finger" is maintained internally, skipping over the gaps in the hash table as needed (so iteration can, internally, make a number of probes equal to the number of hash slots, and no more than that) - but that's no help for indexing by a random integer.
I'll just add that I've never had a real need for a random selection from a set. For all the many set algorithms I've coded, an "arbitrary" element worked just as well as a "random" element, and set.pop() works fine for that.

I also haven't had a use case for random items from a set, but have had a use case for randomly ( not arbitrarily ) choosing from a dict (see trigrams: http://codekata.com/kata/kata14-tom-swift-under-the-milkwood/)
As dicts and sets share implementation, why not both?
Anyway, the hash table has gaps, but wouldn't it be sufficiently random to pick a random index, and if it's a gap, pick another one? I suppose in theory, this could be in infinite process, but in practice, it would be O(1) with a constant of two or three... Better than iterating through on average half of the keys.
( how many gaps are there? I have no idea )
I know nothing about the implementation, and have not thought carefully about the statistics, but it seems do-able. I'll just shut up if I'm way off base.
BTW, isn't it impossible to randomly select from an infinite iterable anyway?
-CHB
On Apr 12, 2016, at 6:43 PM, Greg Ewing greg.ewing@canterbury.ac.nz wrote:
Guido van Rossum wrote:
The problem with your version is that copying the input is slow if it is large.
Maybe sets should have a method that returns an indexable view? The order could be defined as equivalent to iteration order, and it would allow things like random.choice to work efficiently on sets.
-- Greg
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

[Chris Barker - NOAA Federal chris.barker@noaa.gov]
... Anyway, the hash table has gaps, but wouldn't it be sufficiently random to pick a random index, and if it's a gap, pick another one?
That would be a correct implementation, but ...
I suppose in theory, this could be in infinite process, but in practice, it would be O(1) with a constant of two or three... Better than iterating through on average half of the keys.
There's an upper limit on how dense a CPython dict or set can become (the load factor doesn't exceed 2/3), but no lower limit. For example, it's easy to end up with a dict holding a single entry hiding among millions of empty slots (dicts are never resized on key deletion, only on key insertion).
... BTW, isn't it impossible to randomly select from an infinite iterable anyway?
Of course, but it is possible to do uniform random selection, in one pass using constant space and in linear time, from an iterable whose length isn't known in advance (simple case of "reservoir sampling").

There's an upper limit on how dense a CPython dict or set can become (the load factor doesn't exceed 2/3), but no lower limit. For example, it's easy to end up with a dict holding a single entry hiding among millions of empty slots (dicts are never resized on key deletion, only on key insertion).
Easy, yes. Common? I wonder. If it were common then wouldn't there be good reason to resize the hash table when that occurred? Aside from being able to select random items, of course...
-CHB
... BTW, isn't it impossible to randomly select from an infinite iterable anyway?
Of course, but it is possible to do uniform random selection, in one pass using constant space and in linear time, from an iterable whose length isn't known in advance (simple case of "reservoir sampling").

On Wed, Apr 13, 2016 at 8:31 AM, Chris Barker - NOAA Federal chris.barker@noaa.gov wrote:
There's an upper limit on how dense a CPython dict or set can become (the load factor doesn't exceed 2/3), but no lower limit. For example, it's easy to end up with a dict holding a single entry hiding among millions of empty slots (dicts are never resized on key deletion, only on key insertion).
Easy, yes. Common? I wonder. If it were common then wouldn't there be good reason to resize the hash table when that occurred? Aside from being able to select random items, of course...
So it's becoming clear that if we wanted to do this right for sets (and dicts) the choice() implementation should be part of the set/dict implementation. It can then take care of compacting the set if it's density is too low.
But honestly I don't see the use case come up enough to warrant all that effort.

[Tim]
There's an upper limit on how dense a CPython dict or set can become (the load factor doesn't exceed 2/3), but no lower limit. For example, it's easy to end up with a dict holding a single entry hiding among millions of empty slots (dicts are never resized on key deletion, only on key insertion).
[Chris Barker - NOAA Federal chris.barker@noaa.gov]
Easy, yes. Common? I wonder.
Depends on the app. I doubt it's common across all apps.
If it were common then wouldn't there be good reason to resize the hash table when that occurred? Aside from being able to select random items, of course...
Shrinking the table would have no primary effect on speed of subsequent access or deletion - it's O(1) expected regardless. Shrinking the table is expensive when it's done (the entire object has to be traversed, and the items reinserted one by one, into a smaller dict/set).
Regardless, I'd be loathe to add a conditional branch on every deletion to check. Nothing is free, and the check would be a waste of cycles every time for apps that don't care about O(1) uniform random dict/set selection. Note that I don't care about the "wasted" memory, though. But then I don't care about O(1) uniform random dict/set selection either ;-)

On 13.04.2016 07:08, Tim Peters wrote:
[Chris Barker - NOAA Federal chris.barker@noaa.gov]
... Anyway, the hash table has gaps, but wouldn't it be sufficiently random to pick a random index, and if it's a gap, pick another one?
That would be a correct implementation, but ...
I suppose in theory, this could be in infinite process, but in practice, it would be O(1) with a constant of two or three... Better than iterating through on average half of the keys.
There's an upper limit on how dense a CPython dict or set can become (the load factor doesn't exceed 2/3), but no lower limit. For example, it's easy to end up with a dict holding a single entry hiding among millions of empty slots (dicts are never resized on key deletion, only on key insertion).
Converting a set to a list is O(N) (with N being the number of slots allocated for the set, with N > n, the number of used keys), so any gap skipping logic won't be better in performance than doing:
import random
my_set = {1, 2, 3, 4} l = list(my_set) selection = [random.choice(l) for x in l]
print (selection)
You'd only have a memory advantage, AFAICT.
... BTW, isn't it impossible to randomly select from an infinite iterable anyway?
Of course, but it is possible to do uniform random selection, in one pass using constant space and in linear time, from an iterable whose length isn't known in advance (simple case of "reservoir sampling").

On Wed, Apr 13, 2016 at 8:36 PM, Terry Reedy tjreedy@udel.edu wrote:
On 4/13/2016 12:52 AM, Chris Barker - NOAA Federal wrote:
BTW, isn't it impossible to randomly select from an infinite iterable anyway?
With equal probability, yes, impossible. With skewed probabilities, no, possible.
What, you mean like this?
def choice(it): it = iter(it) value = next(it) try: while random.randrange(2): value = next(it) except StopIteration: pass return value
I'm not sure how useful it is, but it does accept potentially infinite iterables, and does return values selected at random...
ChrisA

Chris Angelico wrote:
On Wed, Apr 13, 2016 at 8:36 PM, Terry Reedy tjreedy@udel.edu wrote:
On 4/13/2016 12:52 AM, Chris Barker - NOAA Federal wrote:
BTW, isn't it impossible to randomly select from an infinite iterable anyway?
With equal probability, yes, impossible.
def choice(it): it = iter(it) value = next(it) try: while random.randrange(2): value = next(it) except StopIteration: pass return value
I think Terry meant that you can't pick just one item that's equally likely to be any of the infinitely many items returned by the iterator.
You can prove that by considering that the probability of a given item being returned would have to be 1/infinity, which is zero -- so you can't return anything!

On Thu, Apr 14, 2016 at 09:47:37AM +1200, Greg Ewing wrote:
Chris Angelico wrote:
On Wed, Apr 13, 2016 at 8:36 PM, Terry Reedy tjreedy@udel.edu wrote:
On 4/13/2016 12:52 AM, Chris Barker - NOAA Federal wrote:
BTW, isn't it impossible to randomly select from an infinite iterable anyway?
With equal probability, yes, impossible.
[...]
I think Terry meant that you can't pick just one item that's equally likely to be any of the infinitely many items returned by the iterator.
Correct. That's equivalent to chosing a positive integer with uniform probability distribution and no upper bound.
You can prove that by considering that the probability of a given item being returned would have to be 1/infinity, which is zero -- so you can't return anything!
That's not how probability works :-)
Consider a dart which is thrown at a dartboard. The probability of it landing on any specific point is zero, since the area of a single point is zero. Nevertheless, the dart does hit somewhere!
A formal and precise treatment would have to involve calculus and limits as the probability approaches zero, rather than a flat out "the probability is zero, therefore it's impossible".
Slightly less formally, we can say (only horrifying mathematicians a little bit) that the probability of any specific number is an infinitesimal number.
https://en.wikipedia.org/wiki/Infinitesimal
While it is *mathematically* meaningful to talk about selecting a random positive integer uniformly, its hard to do much more than that. The mean (average) is undefined[1]. A typical value chosen would have a vast number of digits, far larger than anything that could be stored in computer memory. Indeed Almost All[2] of the values we generate would be so large that we have no notation for writing it down (and not enough space in the universe to write it even if we did). So it is impossible in practice to select a random integer with uniform distribution and no upper bound.
Non-uniform distributions, though, are easy :-)
[1] While weird, this is not all that weird. For example, selecting numbers from a Cauchy distribution also has an undefined mean. What this means in practice is that the *sample mean* will not converge as you take more and more samples: the more samples you take, the more wildly the average will jump all over the place.
https://en.wikipedia.org/wiki/Cauchy_distribution#Estimation_of_parameters
[2] https://en.wikipedia.org/wiki/Almost_all

Steven D'Aprano wrote:
A formal and precise treatment would have to involve calculus and limits as the probability approaches zero, rather than a flat out "the probability is zero, therefore it's impossible".
The limit of p = 1/n as n goes to infinity is zero. Events with zero probability can't happen. I don't know how it can be made more rigorous than that.
I think what this means is that if you somehow wrote a program to draw a number from an infinite uniform distribution, it would never terminate.
< A typical value chosen would have a vast
number of digits, far larger than anything that could be stored in computer memory.
I'm not sure it even makes sense to talk about a typical number, because for any number you pick, there are infinitely many numbers with more digits, but only finitely many with the same or fewer digits. So the probability of getting only that many digits is zero, too!
Infinity: Just say no.

On Thu, Apr 14, 2016 at 12:03 AM, Greg Ewing greg.ewing@canterbury.ac.nz wrote:
Steven D'Aprano wrote:
A formal and precise treatment would have to involve calculus and limits as the probability approaches zero, rather than a flat out "the probability is zero, therefore it's impossible".
The limit of p = 1/n as n goes to infinity is zero. Events with zero probability can't happen. I don't know how it can be made more rigorous than that.
This only holds true if the sample space is finite. The mathematical term for an event with zero probability that nonetheless can happen is "almost never".
See https://en.wikipedia.org/wiki/Almost_surely
As usual, infinity is weird.

Greg Ewing writes:
The limit of p = 1/n as n goes to infinity is zero. Events with zero probability can't happen. I don't know how it can be made more rigorous than that.
Sorry, as Steven pointed out, events *that are modeled as* having zero probability happen all the time. What you should not do with such events is give them positive (ie, atomic) weight in statistical calculations (= theory), and therefore you should not *predict* their occurance as individuals (unless you have no fear of becoming a laughingstock = practice).
Infinity: Just say no.
I thought your name was Ewing. Is it really Brouwer?<wink/>
(Don't bother, I know. The reference to Brouwer is close enough for tsukkomi[1].)
Footnotes: [1] https://en.wikipedia.org/wiki/Glossary_of_owarai_terms
N.B. "Straight man" is not an accurate translation.

For any positive integer you select (including those with more digits than there are particles in the universe), ALMOST ALL integers are larger than your selection. I.e. the measure of those smaller remains zero. On Apr 13, 2016 6:31 PM, "Steven D'Aprano" steve@pearwood.info wrote:
On Thu, Apr 14, 2016 at 09:47:37AM +1200, Greg Ewing wrote:
Chris Angelico wrote:
On Wed, Apr 13, 2016 at 8:36 PM, Terry Reedy tjreedy@udel.edu wrote:
On 4/13/2016 12:52 AM, Chris Barker - NOAA Federal wrote:
BTW, isn't it impossible to randomly select from an infinite iterable anyway?
With equal probability, yes, impossible.
[...]
I think Terry meant that you can't pick just one item that's equally likely to be any of the infinitely many items returned by the iterator.
Correct. That's equivalent to chosing a positive integer with uniform probability distribution and no upper bound.
You can prove that by considering that the probability of a given item being returned would have to be 1/infinity, which is zero -- so you can't return anything!
That's not how probability works :-)
Consider a dart which is thrown at a dartboard. The probability of it landing on any specific point is zero, since the area of a single point is zero. Nevertheless, the dart does hit somewhere!
A formal and precise treatment would have to involve calculus and limits as the probability approaches zero, rather than a flat out "the probability is zero, therefore it's impossible".
Slightly less formally, we can say (only horrifying mathematicians a little bit) that the probability of any specific number is an infinitesimal number.
https://en.wikipedia.org/wiki/Infinitesimal
While it is *mathematically* meaningful to talk about selecting a random positive integer uniformly, its hard to do much more than that. The mean (average) is undefined[1]. A typical value chosen would have a vast number of digits, far larger than anything that could be stored in computer memory. Indeed Almost All[2] of the values we generate would be so large that we have no notation for writing it down (and not enough space in the universe to write it even if we did). So it is impossible in practice to select a random integer with uniform distribution and no upper bound.
Non-uniform distributions, though, are easy :-)
[1] While weird, this is not all that weird. For example, selecting numbers from a Cauchy distribution also has an undefined mean. What this means in practice is that the *sample mean* will not converge as you take more and more samples: the more samples you take, the more wildly the average will jump all over the place.
https://en.wikipedia.org/wiki/Cauchy_distribution#Estimation_of_parameters
[2] https://en.wikipedia.org/wiki/Almost_all
-- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 4/13/2016 6:46 AM, Chris Angelico wrote:
On Wed, Apr 13, 2016 at 8:36 PM, Terry Reedy tjreedy@udel.edu wrote:
On 4/13/2016 12:52 AM, Chris Barker - NOAA Federal wrote:
BTW, isn't it impossible to randomly select from an infinite iterable anyway?
With equal probability, yes, impossible.
I have seen too many mathematical or statistical writers who ought to know better write "Let N be a random integer..." with no indication of a finite bound or other than a uniform distribution. No wonder students and readers sometimes get confused.
With skewed probabilities, no, possible.
What, you mean like this?
def choice(it): it = iter(it) value = next(it) try: while random.randrange(2): value = next(it) except StopIteration: pass return value
With a perfect source of random numbers, this will halt with probability one. With current pseudorandom generators, this will definitely halt. And yes, both theoretically and practically, this is an example of skewed probabilities -- a waiting time distribution.
I'm not sure how useful it is, but it does accept potentially infinite iterables, and does return values selected at random...
People often want variates selected from a non-uniform distribution. Often, a uniform variate can be transformed. Sometimes multiple uniform variates are needed.
If 'it' above is itertools.count, the above models the waiting time to get 'heads' (or 'tails') with a fair coin. Waiting times might be obtained more efficiently, perhaps, with, say, randrange(2**16), with another draw for as long as the value is 0 plus some arithmethic that uses int.bit_length. (Details to be verified.)

Rob Cliffe rob.cliffe@btinternet.com writes:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) )
I am +1 on the notion that an instance of ‘tuple’, ‘list’, ‘set’, and even ‘dict’, should each be accepted as input for ‘random.choice’.
On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module' on Python-Ideas, one of which was to allow random.choice to work on an arbitrary iterable.
I am −1 on the notion of an arbitrary iterable as ‘random.choice’ input.
Arbitrary iterables may be consumed merely by being iterated. This will force the caller to make a copy, which may as well be a normal sequence type, defeating the purpose of accepting that “arbitrary iterable” type.
Arbitrary iterables may never finish iterating. This means the call to ‘random.choice’ would sometimes never return.
The ‘random.choice’ function should IMO not be responsible for dealing with those cases.
Instead, a more moderate proposal would be to have ‘random.choice’ accept an arbitrary container.
If the object implements the container protocol (by which I think I mean that it conforms to ‘collections.abc.Container’), it is safe to iterate and to treat its items as a collection from which to choose a random item.
Rob, does that proposal satisfy the requirements that motivated this?

This still seems wrong *unless* we add a protocol to select a random item in O(1) time. There currently isn't one for sets and mappings -- only for sequences. It would be pretty terrible if someone wrote code to get N items from a set of size N and their code ended up being O(N**2).
On Tue, Apr 12, 2016 at 3:57 PM, Ben Finney ben+python@benfinney.id.au wrote:
Rob Cliffe rob.cliffe@btinternet.com writes:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) )
I am +1 on the notion that an instance of ‘tuple’, ‘list’, ‘set’, and even ‘dict’, should each be accepted as input for ‘random.choice’.
On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module' on Python-Ideas, one of which was to allow random.choice to work on an arbitrary iterable.
I am −1 on the notion of an arbitrary iterable as ‘random.choice’ input.
Arbitrary iterables may be consumed merely by being iterated. This will force the caller to make a copy, which may as well be a normal sequence type, defeating the purpose of accepting that “arbitrary iterable” type.
Arbitrary iterables may never finish iterating. This means the call to ‘random.choice’ would sometimes never return.
The ‘random.choice’ function should IMO not be responsible for dealing with those cases.
Instead, a more moderate proposal would be to have ‘random.choice’ accept an arbitrary container.
If the object implements the container protocol (by which I think I mean that it conforms to ‘collections.abc.Container’), it is safe to iterate and to treat its items as a collection from which to choose a random item.
Rob, does that proposal satisfy the requirements that motivated this?
-- \ “Some people, when confronted with a problem, think ‘I know, | `\ I'll use regular expressions’. Now they have two problems.” | _o__) —Jamie Zawinski, in alt.religion.emacs | Ben Finney
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Thanks to everyone for all the feedback so far.
Interesting. I had not heard of reservoir sampling, and it had not occurred to me that one could select a uniformly random element from a sequence of unknown length without copying it.
I am with Ben in that I consider the most important type for random.choice to accept, that it doesn't already, is "set". If just that could be achieved, yes Ben, I think that would be a plus. (The fact that I could write a version that handled generators (efficiently or not) just seemed to be a bonus.) ISTM however that there is a problem accepting dict, because of the ambiguity - does it mean a random key, a random value, or a random (key,value) item? (I would pick a random key (consistent with "for k in MyDict"), but ISTM "explicit is better than implict".)
There are problems with arbitrary iterables as Ben points out. But I don't see these as an objection to making random.choice accept them, because (1) It is easy to produce examples of stdlib functions that can be crashed with infinite iterables, e.g. >>> def forever(): ... while 1: yield 1 ... >>> itertools.product(forever(), forever()) Traceback (most recent call last): File "<stdin>", line 1, in <module> MemoryError (2) As I indicated in my original post, it is understood that generators are "throw-away", so may need to be copied. 'Consenting adults', no?
One small point: Whenever I've tested it (quite a few times, for different projects), I've always found that it was faster to copy to a tuple than to a list.
I did a little experimenting (admittedly non-rigorous, on one platform only, and using Python 2.7.10, not Python 3, and using code which could very possibly be improved) on selecting a random element from a generator, and found that for small or moderate generators, reservoir sampling was almost always slower than generating a tuple as the generator length increased up to roughly 10,000, the ratio (time taken by reservoir) / (time taken by tuple) increased, reaching a maximum of over 4 as the generator length increased further, the ratio started to decrease, although for a length of 80 million (about as large as I could test) it was still over 3. (This would change if the reservoir approach could be recoded in a way that was amazingly faster than the way I did it, but, arrogance aside, I doubt that. I will supply details of my code on request.) I think this is a tribute to the efficiency of tuple building. It's also enough to convince me that reservoir sampling, or Marco Cognetta's compromise approach, are non-starters. More rigorous testing would be necessary to convince the rest of the world.
I am also pretty well convinced from actual tests (not from theoretical reasoning) that: the "convert it to a tuple" recipe is not far off O(N), both for sets and generators (it gets worse for generators that produce objects with large memory footprints), and is certainly fast enough to be useful (the point at which it gets unacceptably slow is generally not far from the point where you run out of memory). I did try an approach similar to Chris Angelico's of iterating over sets up to the selection point (my bikeshed actually used enumerate rather than zip), but it was much slower than the "tuple-ise" algorithm.
On 13/04/2016 00:04, Guido van Rossum wrote:
This still seems wrong *unless* we add a protocol to select a random item in O(1) time. There currently isn't one for sets and mappings -- only for sequences.
Hm, would this need a change to the structure of dicts and sets? Is there by any chance a quick way of getting the Nth item added in an OrderedDict?
It would be pretty terrible if someone wrote code to get N items from a set of size N and their code ended up being O(N**2).
Er, I'm not quite sure what you're saying here. random.sample allows you to select k items from a set of size N (it converts the set to a tuple). If k=N, i.e. you're trying to select the whole set, that's a silly thing to do if you know you're doing it. But even in this case, random.sample is approximately O(N); I tested it. If someone writes their own inefficient version of random.sample, that's their problem. Have I misunderstood? Do you mean selecting N items with replacement?
Hm, it occurs to me that random.sample itself could be extended to choose k unique random elements from a general iterable. I have seen a reservoir sampling algorithm that claims to be O(N): https://en.wikipedia.org/wiki/Reservoir_sampling
Rob
On Tue, Apr 12, 2016 at 3:57 PM, Ben Finney ben+python@benfinney.id.au wrote:
Rob Cliffe rob.cliffe@btinternet.com writes:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) )
I am +1 on the notion that an instance of ‘tuple’, ‘list’, ‘set’, and even ‘dict’, should each be accepted as input for ‘random.choice’.
On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module' on Python-Ideas, one of which was to allow random.choice to work on an arbitrary iterable.
I am −1 on the notion of an arbitrary iterable as ‘random.choice’ input.
Arbitrary iterables may be consumed merely by being iterated. This will force the caller to make a copy, which may as well be a normal sequence type, defeating the purpose of accepting that “arbitrary iterable” type.
Arbitrary iterables may never finish iterating. This means the call to ‘random.choice’ would sometimes never return.
The ‘random.choice’ function should IMO not be responsible for dealing with those cases.
Instead, a more moderate proposal would be to have ‘random.choice’ accept an arbitrary container.
If the object implements the container protocol (by which I think I mean that it conforms to ‘collections.abc.Container’), it is safe to iterate and to treat its items as a collection from which to choose a random item.
Rob, does that proposal satisfy the requirements that motivated this?
-- \ “Some people, when confronted with a problem, think ‘I know, | `\ I'll use regular expressions’. Now they have two problems.” | _o__) —Jamie Zawinski, in alt.religion.emacs | Ben Finney
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 04/12/2016 09:09 PM, Rob Cliffe wrote:
I did a little experimenting (admittedly non-rigorous, on one platform only, and using Python 2.7.10, not Python 3, and using code which could very possibly be improved) on selecting a random element from a generator, and found that
for small or moderate generators, reservoir sampling was almost
always slower than generating a tuple
as the generator length increased up to roughly 10,000, the ratio (time taken by reservoir) / (time taken by tuple) increased, reaching a maximum of over 4 as the generator length increased further, the ratio started to
decrease, although for a length of 80 million (about as large as I could test) it was still over 3.
I suspect this proves the point -- reservoir sampling is good not because it is fast, but because it won't drain your memory keeping items you will not return.
-- ~Ethan~

On Wed, Apr 13, 2016 at 7:40 AM, Rob Cliffe rob.cliffe@btinternet.com wrote:
And here is my proposed (tested) version:
def choice(self, it): """Choose a random element from a non-empty iterable.""" try: it[0] except IndexError: raise IndexError('Cannot choose from an empty sequence') except TypeError: it = tuple(it) try: i = self._randbelow(len(it)) except ValueError: raise IndexError('Cannot choose from an empty iterable') return it[i]
This works on (e.g.) a list/tuple/string, a set, a dictionary view or a generator. Obviously the generator has to be 'fresh' (i.e. any previously consumed values will be excluded from the choice) and 'throw-away' (random.choice will exhaust it). But it means you can write code like random.choice(x for x in xrange(10) if x%3) # this feels quite "Pythonic" to me!
Small point of order: Pretty much everything discussed here on python-ideas is about Python 3. It's best to make sure your code works with the latest CPython (currently 3.5), as a change like this would be landing in 3.6 at the earliest. So what I'd be looking at is this:
random.choice(x for x in range(10) if x%3)
AFAIK this doesn't change your point at all, but it is worth being careful of; Python 3's range object isn't quite the same as Python 2's xrange, and it's possible something might "just work". (For the inverse case, "if x%3 == 0", you can simply use random.choice(random(0, 10, 3)) to do what you want.)
I don't like the proposed acceptance of arbitrary iterables. In the extremely rare case where you actually do want that, you can always manually wrap it in list() or tuple(). But your original use-case does have merit:
It surprised me a bit the first time I realised that random.choice did not work on a set.
A set has a length, but it can't be indexed. It should be perfectly reasonable to ask for a random element out of a set! So here's my counter-proposal: Make random.choice accept any iterable with a __len__.
def choice(self, coll): """Choose a random element from a non-empty collection.""" try: i = self._randbelow(len(coll)) except ValueError: raise IndexError('Cannot choose from an empty collection') try: return coll[i] except TypeError: for _, value in zip(range(i+1), coll): pass return value
Feel free to bikeshed the method of iterating part way into a collection (enumerate() and a comparison? call iter, then call next that many times, then return next(it)?), but the basic concept is that you have to have a length and then you iterate into it that far.
It's still not arbitrary iterables, but it'll handle sets and dict views. Handling dicts directly may be a little tricky; since they support subscripting, they'll either raise KeyError or silently return a value, where the iteration-based return value would be a key. Breaking this out into its own function would be reliable there (take out the try/except and just go straight into iteration).
ChrisA

This is no good. Who wants a choice() that is O(1) on sequences but degrades to O(N) if the argument is a set?
I see the current behavior as a feature: it works when the argument is indexable, and when it isn't, it fails cleanly, prompting the user to use their brain and decide what's right. Maybe you're programming a toy example. Then you can just call random.choice(list(x)) and move on. Or maybe you're trying to do something serious -- then it behooves you to copy the set into a list variable once and then repeatedly choose from that list, or maybe you should have put the data in a list in the first place.
But putting the toy solution in the stdlib is a bad idea, and so is putting a bad (O(N)) algorithm there.
So the status quo wins for a reason!
--Guido
On Tue, Apr 12, 2016 at 7:02 PM, Chris Angelico rosuav@gmail.com wrote:
On Wed, Apr 13, 2016 at 7:40 AM, Rob Cliffe rob.cliffe@btinternet.com wrote:
And here is my proposed (tested) version:
def choice(self, it): """Choose a random element from a non-empty iterable.""" try: it[0] except IndexError: raise IndexError('Cannot choose from an empty sequence') except TypeError: it = tuple(it) try: i = self._randbelow(len(it)) except ValueError: raise IndexError('Cannot choose from an empty iterable') return it[i]
This works on (e.g.) a list/tuple/string, a set, a dictionary view or a generator. Obviously the generator has to be 'fresh' (i.e. any previously consumed values will be excluded from the choice) and 'throw-away' (random.choice will exhaust it). But it means you can write code like random.choice(x for x in xrange(10) if x%3) # this feels quite "Pythonic" to me!
Small point of order: Pretty much everything discussed here on python-ideas is about Python 3. It's best to make sure your code works with the latest CPython (currently 3.5), as a change like this would be landing in 3.6 at the earliest. So what I'd be looking at is this:
random.choice(x for x in range(10) if x%3)
AFAIK this doesn't change your point at all, but it is worth being careful of; Python 3's range object isn't quite the same as Python 2's xrange, and it's possible something might "just work". (For the inverse case, "if x%3 == 0", you can simply use random.choice(random(0, 10, 3)) to do what you want.)
I don't like the proposed acceptance of arbitrary iterables. In the extremely rare case where you actually do want that, you can always manually wrap it in list() or tuple(). But your original use-case does have merit:
It surprised me a bit the first time I realised that random.choice did not work on a set.
A set has a length, but it can't be indexed. It should be perfectly reasonable to ask for a random element out of a set! So here's my counter-proposal: Make random.choice accept any iterable with a __len__.
def choice(self, coll): """Choose a random element from a non-empty collection.""" try: i = self._randbelow(len(coll)) except ValueError: raise IndexError('Cannot choose from an empty collection') try: return coll[i] except TypeError: for _, value in zip(range(i+1), coll): pass return value
Feel free to bikeshed the method of iterating part way into a collection (enumerate() and a comparison? call iter, then call next that many times, then return next(it)?), but the basic concept is that you have to have a length and then you iterate into it that far.
It's still not arbitrary iterables, but it'll handle sets and dict views. Handling dicts directly may be a little tricky; since they support subscripting, they'll either raise KeyError or silently return a value, where the iteration-based return value would be a key. Breaking this out into its own function would be reliable there (take out the try/except and just go straight into iteration).
ChrisA _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On Wed, Apr 13, 2016 at 12:11 PM, Guido van Rossum guido@python.org wrote:
This is no good. Who wants a choice() that is O(1) on sequences but degrades to O(N) if the argument is a set?
Fair point. Suggestion withdrawn.
If you want something that does the iteration-till-it-finds-something, that should be a separate function. (And then it'd work automatically on dicts, too.)
ChrisA

On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe rob.cliffe@btinternet.com wrote:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) )
As you point out in your proposed solution, the problem is `random.choice` expects an index-able value. You can easily create an index-able set using the sortedcontainers.SortedSet interface even if if the elements aren't comparable. Simply use:
```python from sortedcontainers import SortedSet
def zero(value): return 0
class IndexableSet(SortedSet): def __init__(self, *args, **kwargs): super(IndexableSet, self).__init__(*args, key=zero, **kwargs)
values = IndexableSet(range(100))
import random
random.choice(values) ```
`__contains__`, `__iter__`, `len`, `add`, `__getitem__` and `__delitem__` will all be fast. But `discard` will take potentially linear time. So:
``` index = random.randrange(len(values)) value = values[index] del values[index] ```
will be much faster than:
``` value = random.choice(values) values.discard(value) ```
Otherwise `IndexableSet` will work as a drop-in replacement for your `set` needs. Read more about the SortedContainers project at http://www.grantjenks.com/docs/sortedcontainers/
Grant

Maybe, instead of all of this, since as put by others, hardly there would be a "one size fits all" - the nice thing to have would be a "choice" or "randompop" method for sets.
Sets are the one natural kind of object from which it would seen normal to be able to pick a random element (or pop one) - since they have no order to start with, and the one from which it is frustrating when one realizes "random.choice" fails with,
So either have a "random.set_choice" and "random.set_pop" functions in random - or "choice" and "pop_random" on the "set" interface itself could be more practical - in terms of real world usage - than speculate a way for choice to work in iterables of unknown size, among other objects.
On 13 April 2016 at 02:20, Grant Jenks grant.jenks@gmail.com wrote:
On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe rob.cliffe@btinternet.com wrote:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) )
As you point out in your proposed solution, the problem is `random.choice` expects an index-able value. You can easily create an index-able set using the sortedcontainers.SortedSet interface even if if the elements aren't comparable. Simply use:
from sortedcontainers import SortedSet def zero(value): return 0 class IndexableSet(SortedSet): def __init__(self, *args, **kwargs): super(IndexableSet, self).__init__(*args, key=zero, **kwargs) values = IndexableSet(range(100)) import random random.choice(values)
`__contains__`, `__iter__`, `len`, `add`, `__getitem__` and `__delitem__` will all be fast. But `discard` will take potentially linear time. So:
index = random.randrange(len(values)) value = values[index] del values[index]
will be much faster than:
value = random.choice(values) values.discard(value)
Otherwise `IndexableSet` will work as a drop-in replacement for your `set` needs. Read more about the SortedContainers project at http://www.grantjenks.com/docs/sortedcontainers/
Grant _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 13 April 2016 at 06:29, Joao S. O. Bueno jsbueno@python.org.br wrote:
Maybe, instead of all of this, since as put by others, hardly there would be a "one size fits all" - the nice thing to have would be a "choice" or "randompop" method for sets.
Given that your choice method for sets would be little more than
value = random.choice(tuple(s))
it seems like it's probably overkill to build it into Python - particularly as doing so restricts the implementation to Python 3.6+ whereas writing it out by hand works for any version. It also makes the complexity of the operation explicit, which is a good thing.
Paul

On 13/04/2016 09:32, Paul Moore wrote:
On 13 April 2016 at 06:29, Joao S. O. Buenojsbueno@python.org.br wrote:
Maybe, instead of all of this, since as put by others, hardly there would be a "one size fits all" - the nice thing to have would be a "choice" or "randompop" method for sets.
Given that your choice method for sets would be little more than
value = random.choice(tuple(s))
it seems like it's probably overkill to build it into Python - particularly as doing so restricts the implementation to Python 3.6+ whereas writing it out by hand works for any version. It also makes the complexity of the operation explicit, which is a good thing.
Paul _______________________________________________
Isn't there an inconsistency that random.sample caters to a set by converting it to a tuple, but random.choice doesn't? Surely random.sample(someSet, k) will always be slower than a "tuple-ising" random.choice(someSet) implementation? And it wouldn't prevent you from converting your set (or other iterable) to a sequence explicitly. Rob

On Wed, Apr 13, 2016 at 2:47 AM, Rob Cliffe rob.cliffe@btinternet.com wrote:
Isn't there an inconsistency that random.sample caters to a set by converting it to a tuple, but random.choice doesn't?
Perhaps because the use cases are different? Over the years I've learned that inconsistencies aren't always signs of sloppy thinking -- they may actually point to deep issues that aren't apparently on the surface.
I imagine the typical use case for sample() to be something that samples the population once and then does something to the sample; the next time sample() is called the population is probably different (e.g. the next lottery has a different set of players).
But I imagine a fairly common use case for choice() to be choosing from the same population over and over, and that's exactly the case where the copying implementation you're proposing would be a small disaster.

On Wed, Apr 13, 2016 at 12:45 PM Guido van Rossum guido@python.org wrote:
On Wed, Apr 13, 2016 at 2:47 AM, Rob Cliffe rob.cliffe@btinternet.com wrote:
Isn't there an inconsistency that random.sample caters to a set by converting it to a tuple, but random.choice doesn't?
Perhaps because the use cases are different? Over the years I've learned that inconsistencies aren't always signs of sloppy thinking -- they may actually point to deep issues that aren't apparently on the surface.
I imagine the typical use case for sample() to be something that samples the population once and then does something to the sample; the next time sample() is called the population is probably different (e.g. the next lottery has a different set of players).
But I imagine a fairly common use case for choice() to be choosing from the same population over and over, and that's exactly the case where the copying implementation you're proposing would be a small disaster.
Repeated random sampling from the same population is a common use case ("bootstrapping"). Perhaps the oversight was *allowing* sets as input into random.sample.
participants (20)
-
Ben Finney
-
Chris Angelico
-
Chris Barker - NOAA Federal
-
David Mertz
-
Ethan Furman
-
Grant Jenks
-
Greg Ewing
-
Guido van Rossum
-
Ian Kelly
-
Joao S. O. Bueno
-
M.-A. Lemburg
-
Mahmoud Hashemi
-
Marco Cognetta
-
Michael Selik
-
Paul Moore
-
Rob Cliffe
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Tim Peters