
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) ) There is a simple workaround (convert the set to a tuple or list before passing it) but ... On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module' on Python-Ideas, one of which was to allow random.choice to work on an arbitrary iterable. Raymond Hettinger commented "It could be generalized to arbitrary iterables (Bentley provides an example of how to do this) but it is fragile (i.e. falls apart badly with weak random number generators) and doesn't correspond well with real use cases." Sven replied "Again, there definitely are real world use cases ..." I would like to revive this suggestion. I don't understand Raymond's comment. It seems to me that a reasonable implementation is better and more friendly than none, particularly for newbies. This is the source code for random.choice in Python 3.2 (as far as I know it hasn't changed since): def choice(self, seq): """Choose a random element from a non-empty sequence.""" try: i = self._randbelow(len(seq)) except ValueError: raise IndexError('Cannot choose from an empty sequence') return seq[i] And here is my proposed (tested) version: def choice(self, it): """Choose a random element from a non-empty iterable.""" try: it[0] except IndexError: raise IndexError('Cannot choose from an empty sequence') except TypeError: it = tuple(it) try: i = self._randbelow(len(it)) except ValueError: raise IndexError('Cannot choose from an empty iterable') return it[i] This works on (e.g.) a list/tuple/string, a set, a dictionary view or a generator. Obviously the generator has to be 'fresh' (i.e. any previously consumed values will be excluded from the choice) and 'throw-away' (random.choice will exhaust it). But it means you can write code like random.choice(x for x in xrange(10) if x%3) # this feels quite "Pythonic" to me! (I experimented with versions that, when passed an object that supported len() but not indexing, viz. a set, iterated through the object as far as necessary instead of converting it to a list. But I found that this was slower in practice as well as more complicated. There might be a memory issue with huge iterables, but we are no worse off using existing workarounds for those.) Downsides of proposed version: (1) Backward compatibility. Could mask an error if a "wrong" object, but one that happens to be an iterable, is passed to random.choice. There is also slightly different behaviour if a dictionary (D) is passed (an unusual thing to do): The current version will get a random integer in the range [0, len(D)), try to use that as a key of D, and raise KeyError if that fails. The proposed version behaves similarly except that it will always raise "KeyError: 0" if 0 is not a key of D. One could argue that this is an improvement, if it matters at all (error reproducibility). And code that deliberately exploits the existing behaviour, provided that it uses a dictionary whose keys are consecutive integers from 0 upwards, e.g. D = { 0 : 'zero', 1 : 'one', 2 : 'two', 3 : 'three', 4 : 'four' } d = random.choice(D) # equivalent, in this example, to "d = random.choice(list(D.values()))" if d == 'zero': ... will continue to work (although one might argue that it deserves not to). (2) Speed - the proposed version is almost certainly slightly slower when called with a sequence. For what it's worth, I tried to measure the difference on my several-years-old Windows PC, but is was hard to measure accurately, even over millions of iterations. All I can say is that it appeared to be a small fraction of a millisecond, possibly in the region of 50 nanoseconds, per call. Best wishes Rob Cliffe

The problem with your version is that copying the input is slow if it is large. Raymond was referencing the top answer here: http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-rand.... It's also slow though (draws N random values if the input is of length N). On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe <rob.cliffe@btinternet.com> wrote:
-- --Guido van Rossum (python.org/~guido)

Maybe I am misunderstanding but for a generator, couldn't you avoid storing it in memory and just using the following online algorithm to save space? http://stackoverflow.com/a/23352100 (I hope this message goes through correctly, first time participating in a discussion on this mailing list). -Marco On Tue, Apr 12, 2016 at 5:53 PM, Guido van Rossum <guido@python.org> wrote:

Ow, sorry! That's the algorithm I meant to reference (the link I actually gave is for a different situation). See also https://en.wikipedia.org/wiki/Reservoir_sampling. It does indeed avoid storing a copy of the sequence -- at the cost of N calls to random(). Try timing it -- I wouldn't be surprised if copying a set of N elements into a tuple is a lot faster than N random() calls. So we're stuck with two alternatives, neither of which is always the best: (1) just copy the sequence (like Rob's proposal) -- this loses big time if the sequence is large and not already held in memory; (2) use reservoir sampling, at the cost of many more random() calls. Since Rob didn't link to the quoted conversation I don't have it handy to guess what Raymond meant with "Bentley" either, but I'm guessing Jon Bentley wrote about reservoir sampling (before it was named that). I recall hearing about the algorithm for the first time in the late '80s from a coworker with Bell Labs ties. (Later, Ethan wrote)
So the objection is that because performance can vary widely it is better for users to select their own algorithm rather than rely on a one-size-fits-all stdlib solution?
That's pretty much it. Were this 1995, I probably would have put some toy in the stdlib. But these days we should be more careful than that. --Guido On Tue, Apr 12, 2016 at 3:09 PM, Marco Cognetta <cognetta.marco@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

Reservoir sampling is super handy. For a Pythonic introduction, along with links to implementation, see here: https://www.paypal-engineering.com/2016/04/11/statistics-for-software/#dippi... Even though I love the elegance enough to write and illustrate about it, I'd be very surprised to have found it built into Python at this low of a level. Mahmoud On Tue, Apr 12, 2016 at 3:25 PM, Guido van Rossum <guido@python.org> wrote:

At the risk of making this not very easily accessible, couldn't we use both approaches (reservoir sampling and just loading the whole thing into memory) and apply the ski-rental problem? We could come up with some cost for calling random and use that to determine when we should just give up and load it into memory if we do not know the length of the sequence produced by the generator. If the user knows something about the generator beforehand they could specify in some optional parameter whether or not they want to use reservoir sampling or just load it into memory at the start. The theory side of me likes this but I admit its a little ugly to be included in Python's stdlib. -Marco On Tue, Apr 12, 2016 at 7:05 PM, Mahmoud Hashemi <mahmoud@hatnote.com> wrote:

Guido van Rossum wrote:
The problem with your version is that copying the input is slow if it is large.
Maybe sets should have a method that returns an indexable view? The order could be defined as equivalent to iteration order, and it would allow things like random.choice to work efficiently on sets. -- Greg

[Greg Ewing]
Well, sets and dicts are implemented very similarly in CPython. Note that a dict view (whether of keys, values or items) doesn't support indexing! It's unclear how that could be added efficiently, and the same applies to sets. Iteration works pretty efficiently, because a hidden "search finger" is maintained internally, skipping over the gaps in the hash table as needed (so iteration can, internally, make a number of probes equal to the number of hash slots, and no more than that) - but that's no help for indexing by a random integer. I'll just add that I've never had a real need for a random selection from a set. For all the many set algorithms I've coded, an "arbitrary" element worked just as well as a "random" element, and set.pop() works fine for that.

I also haven't had a use case for random items from a set, but have had a use case for randomly ( not arbitrarily ) choosing from a dict (see trigrams: http://codekata.com/kata/kata14-tom-swift-under-the-milkwood/) As dicts and sets share implementation, why not both? Anyway, the hash table has gaps, but wouldn't it be sufficiently random to pick a random index, and if it's a gap, pick another one? I suppose in theory, this could be in infinite process, but in practice, it would be O(1) with a constant of two or three... Better than iterating through on average half of the keys. ( how many gaps are there? I have no idea ) I know nothing about the implementation, and have not thought carefully about the statistics, but it seems do-able. I'll just shut up if I'm way off base. BTW, isn't it impossible to randomly select from an infinite iterable anyway? -CHB

[Chris Barker - NOAA Federal <chris.barker@noaa.gov>]
That would be a correct implementation, but ...
There's an upper limit on how dense a CPython dict or set can become (the load factor doesn't exceed 2/3), but no lower limit. For example, it's easy to end up with a dict holding a single entry hiding among millions of empty slots (dicts are never resized on key deletion, only on key insertion).
... BTW, isn't it impossible to randomly select from an infinite iterable anyway?
Of course, but it is possible to do uniform random selection, in one pass using constant space and in linear time, from an iterable whose length isn't known in advance (simple case of "reservoir sampling").

On Wed, Apr 13, 2016 at 8:31 AM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
So it's becoming clear that if we wanted to do this right for sets (and dicts) the choice() implementation should be part of the set/dict implementation. It can then take care of compacting the set if it's density is too low. But honestly I don't see the use case come up enough to warrant all that effort. -- --Guido van Rossum (python.org/~guido)

[Tim]
[Chris Barker - NOAA Federal <chris.barker@noaa.gov>]
Easy, yes. Common? I wonder.
Depends on the app. I doubt it's common across all apps.
Shrinking the table would have no primary effect on speed of subsequent access or deletion - it's O(1) expected regardless. Shrinking the table is expensive when it's done (the entire object has to be traversed, and the items reinserted one by one, into a smaller dict/set). Regardless, I'd be loathe to add a conditional branch on every deletion to check. Nothing is free, and the check would be a waste of cycles every time for apps that don't care about O(1) uniform random dict/set selection. Note that I don't care about the "wasted" memory, though. But then I don't care about O(1) uniform random dict/set selection either ;-)

On 13.04.2016 07:08, Tim Peters wrote:
Converting a set to a list is O(N) (with N being the number of slots allocated for the set, with N > n, the number of used keys), so any gap skipping logic won't be better in performance than doing: import random my_set = {1, 2, 3, 4} l = list(my_set) selection = [random.choice(l) for x in l] print (selection) You'd only have a memory advantage, AFAICT.
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Apr 13 2016)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On Wed, Apr 13, 2016 at 8:36 PM, Terry Reedy <tjreedy@udel.edu> wrote:
What, you mean like this? def choice(it): it = iter(it) value = next(it) try: while random.randrange(2): value = next(it) except StopIteration: pass return value I'm not sure how useful it is, but it does accept potentially infinite iterables, and does return values selected at random... ChrisA

Chris Angelico wrote:
I think Terry meant that you can't pick just one item that's equally likely to be any of the infinitely many items returned by the iterator. You can prove that by considering that the probability of a given item being returned would have to be 1/infinity, which is zero -- so you can't return anything! -- Greg

On Thu, Apr 14, 2016 at 09:47:37AM +1200, Greg Ewing wrote:
Correct. That's equivalent to chosing a positive integer with uniform probability distribution and no upper bound.
That's not how probability works :-) Consider a dart which is thrown at a dartboard. The probability of it landing on any specific point is zero, since the area of a single point is zero. Nevertheless, the dart does hit somewhere! A formal and precise treatment would have to involve calculus and limits as the probability approaches zero, rather than a flat out "the probability is zero, therefore it's impossible". Slightly less formally, we can say (only horrifying mathematicians a little bit) that the probability of any specific number is an infinitesimal number. https://en.wikipedia.org/wiki/Infinitesimal While it is *mathematically* meaningful to talk about selecting a random positive integer uniformly, its hard to do much more than that. The mean (average) is undefined[1]. A typical value chosen would have a vast number of digits, far larger than anything that could be stored in computer memory. Indeed Almost All[2] of the values we generate would be so large that we have no notation for writing it down (and not enough space in the universe to write it even if we did). So it is impossible in practice to select a random integer with uniform distribution and no upper bound. Non-uniform distributions, though, are easy :-) [1] While weird, this is not all that weird. For example, selecting numbers from a Cauchy distribution also has an undefined mean. What this means in practice is that the *sample mean* will not converge as you take more and more samples: the more samples you take, the more wildly the average will jump all over the place. https://en.wikipedia.org/wiki/Cauchy_distribution#Estimation_of_parameters [2] https://en.wikipedia.org/wiki/Almost_all -- Steve

Steven D'Aprano wrote:
The limit of p = 1/n as n goes to infinity is zero. Events with zero probability can't happen. I don't know how it can be made more rigorous than that. I think what this means is that if you somehow wrote a program to draw a number from an infinite uniform distribution, it would never terminate. < A typical value chosen would have a vast
number of digits, far larger than anything that could be stored in computer memory.
I'm not sure it even makes sense to talk about a typical number, because for any number you pick, there are infinitely many numbers with more digits, but only finitely many with the same or fewer digits. So the probability of getting only that many digits is zero, too! Infinity: Just say no. -- Greg

On Thu, Apr 14, 2016 at 12:03 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
This only holds true if the sample space is finite. The mathematical term for an event with zero probability that nonetheless can happen is "almost never". See https://en.wikipedia.org/wiki/Almost_surely As usual, infinity is weird.

Greg Ewing writes:
Sorry, as Steven pointed out, events *that are modeled as* having zero probability happen all the time. What you should not do with such events is give them positive (ie, atomic) weight in statistical calculations (= theory), and therefore you should not *predict* their occurance as individuals (unless you have no fear of becoming a laughingstock = practice).
Infinity: Just say no.
I thought your name was Ewing. Is it really Brouwer?<wink/> (Don't bother, I know. The reference to Brouwer is close enough for tsukkomi[1].) Footnotes: [1] https://en.wikipedia.org/wiki/Glossary_of_owarai_terms N.B. "Straight man" is not an accurate translation.

For any positive integer you select (including those with more digits than there are particles in the universe), ALMOST ALL integers are larger than your selection. I.e. the measure of those smaller remains zero. On Apr 13, 2016 6:31 PM, "Steven D'Aprano" <steve@pearwood.info> wrote:

On 4/13/2016 6:46 AM, Chris Angelico wrote:
I have seen too many mathematical or statistical writers who ought to know better write "Let N be a random integer..." with no indication of a finite bound or other than a uniform distribution. No wonder students and readers sometimes get confused.
With a perfect source of random numbers, this will halt with probability one. With current pseudorandom generators, this will definitely halt. And yes, both theoretically and practically, this is an example of skewed probabilities -- a waiting time distribution.
I'm not sure how useful it is, but it does accept potentially infinite iterables, and does return values selected at random...
People often want variates selected from a non-uniform distribution. Often, a uniform variate can be transformed. Sometimes multiple uniform variates are needed. If 'it' above is itertools.count, the above models the waiting time to get 'heads' (or 'tails') with a fair coin. Waiting times might be obtained more efficiently, perhaps, with, say, randrange(2**16), with another draw for as long as the value is 0 plus some arithmethic that uses int.bit_length. (Details to be verified.) -- Terry Jan Reedy

Rob Cliffe <rob.cliffe@btinternet.com> writes:
I am +1 on the notion that an instance of ‘tuple’, ‘list’, ‘set’, and even ‘dict’, should each be accepted as input for ‘random.choice’.
I am −1 on the notion of an arbitrary iterable as ‘random.choice’ input. Arbitrary iterables may be consumed merely by being iterated. This will force the caller to make a copy, which may as well be a normal sequence type, defeating the purpose of accepting that “arbitrary iterable” type. Arbitrary iterables may never finish iterating. This means the call to ‘random.choice’ would sometimes never return. The ‘random.choice’ function should IMO not be responsible for dealing with those cases. Instead, a more moderate proposal would be to have ‘random.choice’ accept an arbitrary container. If the object implements the container protocol (by which I think I mean that it conforms to ‘collections.abc.Container’), it is safe to iterate and to treat its items as a collection from which to choose a random item. Rob, does that proposal satisfy the requirements that motivated this? -- \ “Some people, when confronted with a problem, think ‘I know, | `\ I'll use regular expressions’. Now they have two problems.” | _o__) —Jamie Zawinski, in alt.religion.emacs | Ben Finney

This still seems wrong *unless* we add a protocol to select a random item in O(1) time. There currently isn't one for sets and mappings -- only for sequences. It would be pretty terrible if someone wrote code to get N items from a set of size N and their code ended up being O(N**2). On Tue, Apr 12, 2016 at 3:57 PM, Ben Finney <ben+python@benfinney.id.au> wrote:
-- --Guido van Rossum (python.org/~guido)

Thanks to everyone for all the feedback so far. Interesting. I had not heard of reservoir sampling, and it had not occurred to me that one could select a uniformly random element from a sequence of unknown length without copying it. I am with Ben in that I consider the most important type for random.choice to accept, that it doesn't already, is "set". If just that could be achieved, yes Ben, I think that would be a plus. (The fact that I could write a version that handled generators (efficiently or not) just seemed to be a bonus.) ISTM however that there is a problem accepting dict, because of the ambiguity - does it mean a random key, a random value, or a random (key,value) item? (I would pick a random key (consistent with "for k in MyDict"), but ISTM "explicit is better than implict".) There are problems with arbitrary iterables as Ben points out. But I don't see these as an objection to making random.choice accept them, because (1) It is easy to produce examples of stdlib functions that can be crashed with infinite iterables, e.g. >>> def forever(): ... while 1: yield 1 ... >>> itertools.product(forever(), forever()) Traceback (most recent call last): File "<stdin>", line 1, in <module> MemoryError (2) As I indicated in my original post, it is understood that generators are "throw-away", so may need to be copied. 'Consenting adults', no? One small point: Whenever I've tested it (quite a few times, for different projects), I've always found that it was faster to copy to a tuple than to a list. I did a little experimenting (admittedly non-rigorous, on one platform only, and using Python 2.7.10, not Python 3, and using code which could very possibly be improved) on selecting a random element from a generator, and found that for small or moderate generators, reservoir sampling was almost always slower than generating a tuple as the generator length increased up to roughly 10,000, the ratio (time taken by reservoir) / (time taken by tuple) increased, reaching a maximum of over 4 as the generator length increased further, the ratio started to decrease, although for a length of 80 million (about as large as I could test) it was still over 3. (This would change if the reservoir approach could be recoded in a way that was amazingly faster than the way I did it, but, arrogance aside, I doubt that. I will supply details of my code on request.) I think this is a tribute to the efficiency of tuple building. It's also enough to convince me that reservoir sampling, or Marco Cognetta's compromise approach, are non-starters. More rigorous testing would be necessary to convince the rest of the world. I am also pretty well convinced from actual tests (not from theoretical reasoning) that: the "convert it to a tuple" recipe is not far off O(N), both for sets and generators (it gets worse for generators that produce objects with large memory footprints), and is certainly fast enough to be useful (the point at which it gets unacceptably slow is generally not far from the point where you run out of memory). I did try an approach similar to Chris Angelico's of iterating over sets up to the selection point (my bikeshed actually used enumerate rather than zip), but it was much slower than the "tuple-ise" algorithm. On 13/04/2016 00:04, Guido van Rossum wrote: their problem. Have I misunderstood? Do you mean selecting N items with replacement? Hm, it occurs to me that random.sample itself could be extended to choose k unique random elements from a general iterable. I have seen a reservoir sampling algorithm that claims to be O(N): https://en.wikipedia.org/wiki/Reservoir_sampling Rob On Tue, Apr 12, 2016 at 3:57 PM, Ben Finney <ben+python@benfinney.id.au> wrote:

On Wed, Apr 13, 2016 at 7:40 AM, Rob Cliffe <rob.cliffe@btinternet.com> wrote:
Small point of order: Pretty much everything discussed here on python-ideas is about Python 3. It's best to make sure your code works with the latest CPython (currently 3.5), as a change like this would be landing in 3.6 at the earliest. So what I'd be looking at is this: random.choice(x for x in range(10) if x%3) AFAIK this doesn't change your point at all, but it is worth being careful of; Python 3's range object isn't quite the same as Python 2's xrange, and it's possible something might "just work". (For the inverse case, "if x%3 == 0", you can simply use random.choice(random(0, 10, 3)) to do what you want.) I don't like the proposed acceptance of arbitrary iterables. In the extremely rare case where you actually do want that, you can always manually wrap it in list() or tuple(). But your original use-case does have merit:
It surprised me a bit the first time I realised that random.choice did not work on a set.
A set has a length, but it can't be indexed. It should be perfectly reasonable to ask for a random element out of a set! So here's my counter-proposal: Make random.choice accept any iterable with a __len__. def choice(self, coll): """Choose a random element from a non-empty collection.""" try: i = self._randbelow(len(coll)) except ValueError: raise IndexError('Cannot choose from an empty collection') try: return coll[i] except TypeError: for _, value in zip(range(i+1), coll): pass return value Feel free to bikeshed the method of iterating part way into a collection (enumerate() and a comparison? call iter, then call next that many times, then return next(it)?), but the basic concept is that you have to have a length and then you iterate into it that far. It's still not arbitrary iterables, but it'll handle sets and dict views. Handling dicts directly may be a little tricky; since they support subscripting, they'll either raise KeyError or silently return a value, where the iteration-based return value would be a key. Breaking this out into its own function would be reliable there (take out the try/except and just go straight into iteration). ChrisA

This is no good. Who wants a choice() that is O(1) on sequences but degrades to O(N) if the argument is a set? I see the current behavior as a feature: it works when the argument is indexable, and when it isn't, it fails cleanly, prompting the user to use their brain and decide what's right. Maybe you're programming a toy example. Then you can just call random.choice(list(x)) and move on. Or maybe you're trying to do something serious -- then it behooves you to copy the set into a list variable once and then repeatedly choose from that list, or maybe you should have put the data in a list in the first place. But putting the toy solution in the stdlib is a bad idea, and so is putting a bad (O(N)) algorithm there. So the status quo wins for a reason! --Guido On Tue, Apr 12, 2016 at 7:02 PM, Chris Angelico <rosuav@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On Wed, Apr 13, 2016 at 12:11 PM, Guido van Rossum <guido@python.org> wrote:
This is no good. Who wants a choice() that is O(1) on sequences but degrades to O(N) if the argument is a set?
Fair point. Suggestion withdrawn. If you want something that does the iteration-till-it-finds-something, that should be a separate function. (And then it'd work automatically on dicts, too.) ChrisA

On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe <rob.cliffe@btinternet.com> wrote:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) )
As you point out in your proposed solution, the problem is `random.choice` expects an index-able value. You can easily create an index-able set using the sortedcontainers.SortedSet interface even if if the elements aren't comparable. Simply use: ```python from sortedcontainers import SortedSet def zero(value): return 0 class IndexableSet(SortedSet): def __init__(self, *args, **kwargs): super(IndexableSet, self).__init__(*args, key=zero, **kwargs) values = IndexableSet(range(100)) import random random.choice(values) ``` `__contains__`, `__iter__`, `len`, `add`, `__getitem__` and `__delitem__` will all be fast. But `discard` will take potentially linear time. So: ``` index = random.randrange(len(values)) value = values[index] del values[index] ``` will be much faster than: ``` value = random.choice(values) values.discard(value) ``` Otherwise `IndexableSet` will work as a drop-in replacement for your `set` needs. Read more about the SortedContainers project at http://www.grantjenks.com/docs/sortedcontainers/ Grant

Maybe, instead of all of this, since as put by others, hardly there would be a "one size fits all" - the nice thing to have would be a "choice" or "randompop" method for sets. Sets are the one natural kind of object from which it would seen normal to be able to pick a random element (or pop one) - since they have no order to start with, and the one from which it is frustrating when one realizes "random.choice" fails with, So either have a "random.set_choice" and "random.set_pop" functions in random - or "choice" and "pop_random" on the "set" interface itself could be more practical - in terms of real world usage - than speculate a way for choice to work in iterables of unknown size, among other objects. On 13 April 2016 at 02:20, Grant Jenks <grant.jenks@gmail.com> wrote:

On 13 April 2016 at 06:29, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
Given that your choice method for sets would be little more than value = random.choice(tuple(s)) it seems like it's probably overkill to build it into Python - particularly as doing so restricts the implementation to Python 3.6+ whereas writing it out by hand works for any version. It also makes the complexity of the operation explicit, which is a good thing. Paul

On 13/04/2016 09:32, Paul Moore wrote:
Isn't there an inconsistency that random.sample caters to a set by converting it to a tuple, but random.choice doesn't? Surely random.sample(someSet, k) will always be slower than a "tuple-ising" random.choice(someSet) implementation? And it wouldn't prevent you from converting your set (or other iterable) to a sequence explicitly. Rob

On Wed, Apr 13, 2016 at 2:47 AM, Rob Cliffe <rob.cliffe@btinternet.com> wrote:
Isn't there an inconsistency that random.sample caters to a set by converting it to a tuple, but random.choice doesn't?
Perhaps because the use cases are different? Over the years I've learned that inconsistencies aren't always signs of sloppy thinking -- they may actually point to deep issues that aren't apparently on the surface. I imagine the typical use case for sample() to be something that samples the population once and then does something to the sample; the next time sample() is called the population is probably different (e.g. the next lottery has a different set of players). But I imagine a fairly common use case for choice() to be choosing from the same population over and over, and that's exactly the case where the copying implementation you're proposing would be a small disaster. -- --Guido van Rossum (python.org/~guido)

The problem with your version is that copying the input is slow if it is large. Raymond was referencing the top answer here: http://stackoverflow.com/questions/2394246/algorithm-to-select-a-single-rand.... It's also slow though (draws N random values if the input is of length N). On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe <rob.cliffe@btinternet.com> wrote:
-- --Guido van Rossum (python.org/~guido)

Maybe I am misunderstanding but for a generator, couldn't you avoid storing it in memory and just using the following online algorithm to save space? http://stackoverflow.com/a/23352100 (I hope this message goes through correctly, first time participating in a discussion on this mailing list). -Marco On Tue, Apr 12, 2016 at 5:53 PM, Guido van Rossum <guido@python.org> wrote:

Ow, sorry! That's the algorithm I meant to reference (the link I actually gave is for a different situation). See also https://en.wikipedia.org/wiki/Reservoir_sampling. It does indeed avoid storing a copy of the sequence -- at the cost of N calls to random(). Try timing it -- I wouldn't be surprised if copying a set of N elements into a tuple is a lot faster than N random() calls. So we're stuck with two alternatives, neither of which is always the best: (1) just copy the sequence (like Rob's proposal) -- this loses big time if the sequence is large and not already held in memory; (2) use reservoir sampling, at the cost of many more random() calls. Since Rob didn't link to the quoted conversation I don't have it handy to guess what Raymond meant with "Bentley" either, but I'm guessing Jon Bentley wrote about reservoir sampling (before it was named that). I recall hearing about the algorithm for the first time in the late '80s from a coworker with Bell Labs ties. (Later, Ethan wrote)
So the objection is that because performance can vary widely it is better for users to select their own algorithm rather than rely on a one-size-fits-all stdlib solution?
That's pretty much it. Were this 1995, I probably would have put some toy in the stdlib. But these days we should be more careful than that. --Guido On Tue, Apr 12, 2016 at 3:09 PM, Marco Cognetta <cognetta.marco@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

Reservoir sampling is super handy. For a Pythonic introduction, along with links to implementation, see here: https://www.paypal-engineering.com/2016/04/11/statistics-for-software/#dippi... Even though I love the elegance enough to write and illustrate about it, I'd be very surprised to have found it built into Python at this low of a level. Mahmoud On Tue, Apr 12, 2016 at 3:25 PM, Guido van Rossum <guido@python.org> wrote:

At the risk of making this not very easily accessible, couldn't we use both approaches (reservoir sampling and just loading the whole thing into memory) and apply the ski-rental problem? We could come up with some cost for calling random and use that to determine when we should just give up and load it into memory if we do not know the length of the sequence produced by the generator. If the user knows something about the generator beforehand they could specify in some optional parameter whether or not they want to use reservoir sampling or just load it into memory at the start. The theory side of me likes this but I admit its a little ugly to be included in Python's stdlib. -Marco On Tue, Apr 12, 2016 at 7:05 PM, Mahmoud Hashemi <mahmoud@hatnote.com> wrote:

Guido van Rossum wrote:
The problem with your version is that copying the input is slow if it is large.
Maybe sets should have a method that returns an indexable view? The order could be defined as equivalent to iteration order, and it would allow things like random.choice to work efficiently on sets. -- Greg

[Greg Ewing]
Well, sets and dicts are implemented very similarly in CPython. Note that a dict view (whether of keys, values or items) doesn't support indexing! It's unclear how that could be added efficiently, and the same applies to sets. Iteration works pretty efficiently, because a hidden "search finger" is maintained internally, skipping over the gaps in the hash table as needed (so iteration can, internally, make a number of probes equal to the number of hash slots, and no more than that) - but that's no help for indexing by a random integer. I'll just add that I've never had a real need for a random selection from a set. For all the many set algorithms I've coded, an "arbitrary" element worked just as well as a "random" element, and set.pop() works fine for that.

I also haven't had a use case for random items from a set, but have had a use case for randomly ( not arbitrarily ) choosing from a dict (see trigrams: http://codekata.com/kata/kata14-tom-swift-under-the-milkwood/) As dicts and sets share implementation, why not both? Anyway, the hash table has gaps, but wouldn't it be sufficiently random to pick a random index, and if it's a gap, pick another one? I suppose in theory, this could be in infinite process, but in practice, it would be O(1) with a constant of two or three... Better than iterating through on average half of the keys. ( how many gaps are there? I have no idea ) I know nothing about the implementation, and have not thought carefully about the statistics, but it seems do-able. I'll just shut up if I'm way off base. BTW, isn't it impossible to randomly select from an infinite iterable anyway? -CHB

[Chris Barker - NOAA Federal <chris.barker@noaa.gov>]
That would be a correct implementation, but ...
There's an upper limit on how dense a CPython dict or set can become (the load factor doesn't exceed 2/3), but no lower limit. For example, it's easy to end up with a dict holding a single entry hiding among millions of empty slots (dicts are never resized on key deletion, only on key insertion).
... BTW, isn't it impossible to randomly select from an infinite iterable anyway?
Of course, but it is possible to do uniform random selection, in one pass using constant space and in linear time, from an iterable whose length isn't known in advance (simple case of "reservoir sampling").

On Wed, Apr 13, 2016 at 8:31 AM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
So it's becoming clear that if we wanted to do this right for sets (and dicts) the choice() implementation should be part of the set/dict implementation. It can then take care of compacting the set if it's density is too low. But honestly I don't see the use case come up enough to warrant all that effort. -- --Guido van Rossum (python.org/~guido)

[Tim]
[Chris Barker - NOAA Federal <chris.barker@noaa.gov>]
Easy, yes. Common? I wonder.
Depends on the app. I doubt it's common across all apps.
Shrinking the table would have no primary effect on speed of subsequent access or deletion - it's O(1) expected regardless. Shrinking the table is expensive when it's done (the entire object has to be traversed, and the items reinserted one by one, into a smaller dict/set). Regardless, I'd be loathe to add a conditional branch on every deletion to check. Nothing is free, and the check would be a waste of cycles every time for apps that don't care about O(1) uniform random dict/set selection. Note that I don't care about the "wasted" memory, though. But then I don't care about O(1) uniform random dict/set selection either ;-)

On 13.04.2016 07:08, Tim Peters wrote:
Converting a set to a list is O(N) (with N being the number of slots allocated for the set, with N > n, the number of used keys), so any gap skipping logic won't be better in performance than doing: import random my_set = {1, 2, 3, 4} l = list(my_set) selection = [random.choice(l) for x in l] print (selection) You'd only have a memory advantage, AFAICT.
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Apr 13 2016)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On Wed, Apr 13, 2016 at 8:36 PM, Terry Reedy <tjreedy@udel.edu> wrote:
What, you mean like this? def choice(it): it = iter(it) value = next(it) try: while random.randrange(2): value = next(it) except StopIteration: pass return value I'm not sure how useful it is, but it does accept potentially infinite iterables, and does return values selected at random... ChrisA

Chris Angelico wrote:
I think Terry meant that you can't pick just one item that's equally likely to be any of the infinitely many items returned by the iterator. You can prove that by considering that the probability of a given item being returned would have to be 1/infinity, which is zero -- so you can't return anything! -- Greg

On Thu, Apr 14, 2016 at 09:47:37AM +1200, Greg Ewing wrote:
Correct. That's equivalent to chosing a positive integer with uniform probability distribution and no upper bound.
That's not how probability works :-) Consider a dart which is thrown at a dartboard. The probability of it landing on any specific point is zero, since the area of a single point is zero. Nevertheless, the dart does hit somewhere! A formal and precise treatment would have to involve calculus and limits as the probability approaches zero, rather than a flat out "the probability is zero, therefore it's impossible". Slightly less formally, we can say (only horrifying mathematicians a little bit) that the probability of any specific number is an infinitesimal number. https://en.wikipedia.org/wiki/Infinitesimal While it is *mathematically* meaningful to talk about selecting a random positive integer uniformly, its hard to do much more than that. The mean (average) is undefined[1]. A typical value chosen would have a vast number of digits, far larger than anything that could be stored in computer memory. Indeed Almost All[2] of the values we generate would be so large that we have no notation for writing it down (and not enough space in the universe to write it even if we did). So it is impossible in practice to select a random integer with uniform distribution and no upper bound. Non-uniform distributions, though, are easy :-) [1] While weird, this is not all that weird. For example, selecting numbers from a Cauchy distribution also has an undefined mean. What this means in practice is that the *sample mean* will not converge as you take more and more samples: the more samples you take, the more wildly the average will jump all over the place. https://en.wikipedia.org/wiki/Cauchy_distribution#Estimation_of_parameters [2] https://en.wikipedia.org/wiki/Almost_all -- Steve

Steven D'Aprano wrote:
The limit of p = 1/n as n goes to infinity is zero. Events with zero probability can't happen. I don't know how it can be made more rigorous than that. I think what this means is that if you somehow wrote a program to draw a number from an infinite uniform distribution, it would never terminate. < A typical value chosen would have a vast
number of digits, far larger than anything that could be stored in computer memory.
I'm not sure it even makes sense to talk about a typical number, because for any number you pick, there are infinitely many numbers with more digits, but only finitely many with the same or fewer digits. So the probability of getting only that many digits is zero, too! Infinity: Just say no. -- Greg

On Thu, Apr 14, 2016 at 12:03 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
This only holds true if the sample space is finite. The mathematical term for an event with zero probability that nonetheless can happen is "almost never". See https://en.wikipedia.org/wiki/Almost_surely As usual, infinity is weird.

Greg Ewing writes:
Sorry, as Steven pointed out, events *that are modeled as* having zero probability happen all the time. What you should not do with such events is give them positive (ie, atomic) weight in statistical calculations (= theory), and therefore you should not *predict* their occurance as individuals (unless you have no fear of becoming a laughingstock = practice).
Infinity: Just say no.
I thought your name was Ewing. Is it really Brouwer?<wink/> (Don't bother, I know. The reference to Brouwer is close enough for tsukkomi[1].) Footnotes: [1] https://en.wikipedia.org/wiki/Glossary_of_owarai_terms N.B. "Straight man" is not an accurate translation.

For any positive integer you select (including those with more digits than there are particles in the universe), ALMOST ALL integers are larger than your selection. I.e. the measure of those smaller remains zero. On Apr 13, 2016 6:31 PM, "Steven D'Aprano" <steve@pearwood.info> wrote:

On 4/13/2016 6:46 AM, Chris Angelico wrote:
I have seen too many mathematical or statistical writers who ought to know better write "Let N be a random integer..." with no indication of a finite bound or other than a uniform distribution. No wonder students and readers sometimes get confused.
With a perfect source of random numbers, this will halt with probability one. With current pseudorandom generators, this will definitely halt. And yes, both theoretically and practically, this is an example of skewed probabilities -- a waiting time distribution.
I'm not sure how useful it is, but it does accept potentially infinite iterables, and does return values selected at random...
People often want variates selected from a non-uniform distribution. Often, a uniform variate can be transformed. Sometimes multiple uniform variates are needed. If 'it' above is itertools.count, the above models the waiting time to get 'heads' (or 'tails') with a fair coin. Waiting times might be obtained more efficiently, perhaps, with, say, randrange(2**16), with another draw for as long as the value is 0 plus some arithmethic that uses int.bit_length. (Details to be verified.) -- Terry Jan Reedy

Rob Cliffe <rob.cliffe@btinternet.com> writes:
I am +1 on the notion that an instance of ‘tuple’, ‘list’, ‘set’, and even ‘dict’, should each be accepted as input for ‘random.choice’.
I am −1 on the notion of an arbitrary iterable as ‘random.choice’ input. Arbitrary iterables may be consumed merely by being iterated. This will force the caller to make a copy, which may as well be a normal sequence type, defeating the purpose of accepting that “arbitrary iterable” type. Arbitrary iterables may never finish iterating. This means the call to ‘random.choice’ would sometimes never return. The ‘random.choice’ function should IMO not be responsible for dealing with those cases. Instead, a more moderate proposal would be to have ‘random.choice’ accept an arbitrary container. If the object implements the container protocol (by which I think I mean that it conforms to ‘collections.abc.Container’), it is safe to iterate and to treat its items as a collection from which to choose a random item. Rob, does that proposal satisfy the requirements that motivated this? -- \ “Some people, when confronted with a problem, think ‘I know, | `\ I'll use regular expressions’. Now they have two problems.” | _o__) —Jamie Zawinski, in alt.religion.emacs | Ben Finney

This still seems wrong *unless* we add a protocol to select a random item in O(1) time. There currently isn't one for sets and mappings -- only for sequences. It would be pretty terrible if someone wrote code to get N items from a set of size N and their code ended up being O(N**2). On Tue, Apr 12, 2016 at 3:57 PM, Ben Finney <ben+python@benfinney.id.au> wrote:
-- --Guido van Rossum (python.org/~guido)

Thanks to everyone for all the feedback so far. Interesting. I had not heard of reservoir sampling, and it had not occurred to me that one could select a uniformly random element from a sequence of unknown length without copying it. I am with Ben in that I consider the most important type for random.choice to accept, that it doesn't already, is "set". If just that could be achieved, yes Ben, I think that would be a plus. (The fact that I could write a version that handled generators (efficiently or not) just seemed to be a bonus.) ISTM however that there is a problem accepting dict, because of the ambiguity - does it mean a random key, a random value, or a random (key,value) item? (I would pick a random key (consistent with "for k in MyDict"), but ISTM "explicit is better than implict".) There are problems with arbitrary iterables as Ben points out. But I don't see these as an objection to making random.choice accept them, because (1) It is easy to produce examples of stdlib functions that can be crashed with infinite iterables, e.g. >>> def forever(): ... while 1: yield 1 ... >>> itertools.product(forever(), forever()) Traceback (most recent call last): File "<stdin>", line 1, in <module> MemoryError (2) As I indicated in my original post, it is understood that generators are "throw-away", so may need to be copied. 'Consenting adults', no? One small point: Whenever I've tested it (quite a few times, for different projects), I've always found that it was faster to copy to a tuple than to a list. I did a little experimenting (admittedly non-rigorous, on one platform only, and using Python 2.7.10, not Python 3, and using code which could very possibly be improved) on selecting a random element from a generator, and found that for small or moderate generators, reservoir sampling was almost always slower than generating a tuple as the generator length increased up to roughly 10,000, the ratio (time taken by reservoir) / (time taken by tuple) increased, reaching a maximum of over 4 as the generator length increased further, the ratio started to decrease, although for a length of 80 million (about as large as I could test) it was still over 3. (This would change if the reservoir approach could be recoded in a way that was amazingly faster than the way I did it, but, arrogance aside, I doubt that. I will supply details of my code on request.) I think this is a tribute to the efficiency of tuple building. It's also enough to convince me that reservoir sampling, or Marco Cognetta's compromise approach, are non-starters. More rigorous testing would be necessary to convince the rest of the world. I am also pretty well convinced from actual tests (not from theoretical reasoning) that: the "convert it to a tuple" recipe is not far off O(N), both for sets and generators (it gets worse for generators that produce objects with large memory footprints), and is certainly fast enough to be useful (the point at which it gets unacceptably slow is generally not far from the point where you run out of memory). I did try an approach similar to Chris Angelico's of iterating over sets up to the selection point (my bikeshed actually used enumerate rather than zip), but it was much slower than the "tuple-ise" algorithm. On 13/04/2016 00:04, Guido van Rossum wrote: their problem. Have I misunderstood? Do you mean selecting N items with replacement? Hm, it occurs to me that random.sample itself could be extended to choose k unique random elements from a general iterable. I have seen a reservoir sampling algorithm that claims to be O(N): https://en.wikipedia.org/wiki/Reservoir_sampling Rob On Tue, Apr 12, 2016 at 3:57 PM, Ben Finney <ben+python@benfinney.id.au> wrote:

On Wed, Apr 13, 2016 at 7:40 AM, Rob Cliffe <rob.cliffe@btinternet.com> wrote:
Small point of order: Pretty much everything discussed here on python-ideas is about Python 3. It's best to make sure your code works with the latest CPython (currently 3.5), as a change like this would be landing in 3.6 at the earliest. So what I'd be looking at is this: random.choice(x for x in range(10) if x%3) AFAIK this doesn't change your point at all, but it is worth being careful of; Python 3's range object isn't quite the same as Python 2's xrange, and it's possible something might "just work". (For the inverse case, "if x%3 == 0", you can simply use random.choice(random(0, 10, 3)) to do what you want.) I don't like the proposed acceptance of arbitrary iterables. In the extremely rare case where you actually do want that, you can always manually wrap it in list() or tuple(). But your original use-case does have merit:
It surprised me a bit the first time I realised that random.choice did not work on a set.
A set has a length, but it can't be indexed. It should be perfectly reasonable to ask for a random element out of a set! So here's my counter-proposal: Make random.choice accept any iterable with a __len__. def choice(self, coll): """Choose a random element from a non-empty collection.""" try: i = self._randbelow(len(coll)) except ValueError: raise IndexError('Cannot choose from an empty collection') try: return coll[i] except TypeError: for _, value in zip(range(i+1), coll): pass return value Feel free to bikeshed the method of iterating part way into a collection (enumerate() and a comparison? call iter, then call next that many times, then return next(it)?), but the basic concept is that you have to have a length and then you iterate into it that far. It's still not arbitrary iterables, but it'll handle sets and dict views. Handling dicts directly may be a little tricky; since they support subscripting, they'll either raise KeyError or silently return a value, where the iteration-based return value would be a key. Breaking this out into its own function would be reliable there (take out the try/except and just go straight into iteration). ChrisA

This is no good. Who wants a choice() that is O(1) on sequences but degrades to O(N) if the argument is a set? I see the current behavior as a feature: it works when the argument is indexable, and when it isn't, it fails cleanly, prompting the user to use their brain and decide what's right. Maybe you're programming a toy example. Then you can just call random.choice(list(x)) and move on. Or maybe you're trying to do something serious -- then it behooves you to copy the set into a list variable once and then repeatedly choose from that list, or maybe you should have put the data in a list in the first place. But putting the toy solution in the stdlib is a bad idea, and so is putting a bad (O(N)) algorithm there. So the status quo wins for a reason! --Guido On Tue, Apr 12, 2016 at 7:02 PM, Chris Angelico <rosuav@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On Wed, Apr 13, 2016 at 12:11 PM, Guido van Rossum <guido@python.org> wrote:
This is no good. Who wants a choice() that is O(1) on sequences but degrades to O(N) if the argument is a set?
Fair point. Suggestion withdrawn. If you want something that does the iteration-till-it-finds-something, that should be a separate function. (And then it'd work automatically on dicts, too.) ChrisA

On Tue, Apr 12, 2016 at 2:40 PM, Rob Cliffe <rob.cliffe@btinternet.com> wrote:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) )
As you point out in your proposed solution, the problem is `random.choice` expects an index-able value. You can easily create an index-able set using the sortedcontainers.SortedSet interface even if if the elements aren't comparable. Simply use: ```python from sortedcontainers import SortedSet def zero(value): return 0 class IndexableSet(SortedSet): def __init__(self, *args, **kwargs): super(IndexableSet, self).__init__(*args, key=zero, **kwargs) values = IndexableSet(range(100)) import random random.choice(values) ``` `__contains__`, `__iter__`, `len`, `add`, `__getitem__` and `__delitem__` will all be fast. But `discard` will take potentially linear time. So: ``` index = random.randrange(len(values)) value = values[index] del values[index] ``` will be much faster than: ``` value = random.choice(values) values.discard(value) ``` Otherwise `IndexableSet` will work as a drop-in replacement for your `set` needs. Read more about the SortedContainers project at http://www.grantjenks.com/docs/sortedcontainers/ Grant

Maybe, instead of all of this, since as put by others, hardly there would be a "one size fits all" - the nice thing to have would be a "choice" or "randompop" method for sets. Sets are the one natural kind of object from which it would seen normal to be able to pick a random element (or pop one) - since they have no order to start with, and the one from which it is frustrating when one realizes "random.choice" fails with, So either have a "random.set_choice" and "random.set_pop" functions in random - or "choice" and "pop_random" on the "set" interface itself could be more practical - in terms of real world usage - than speculate a way for choice to work in iterables of unknown size, among other objects. On 13 April 2016 at 02:20, Grant Jenks <grant.jenks@gmail.com> wrote:

On 13 April 2016 at 06:29, Joao S. O. Bueno <jsbueno@python.org.br> wrote:
Given that your choice method for sets would be little more than value = random.choice(tuple(s)) it seems like it's probably overkill to build it into Python - particularly as doing so restricts the implementation to Python 3.6+ whereas writing it out by hand works for any version. It also makes the complexity of the operation explicit, which is a good thing. Paul

On 13/04/2016 09:32, Paul Moore wrote:
Isn't there an inconsistency that random.sample caters to a set by converting it to a tuple, but random.choice doesn't? Surely random.sample(someSet, k) will always be slower than a "tuple-ising" random.choice(someSet) implementation? And it wouldn't prevent you from converting your set (or other iterable) to a sequence explicitly. Rob

On Wed, Apr 13, 2016 at 2:47 AM, Rob Cliffe <rob.cliffe@btinternet.com> wrote:
Isn't there an inconsistency that random.sample caters to a set by converting it to a tuple, but random.choice doesn't?
Perhaps because the use cases are different? Over the years I've learned that inconsistencies aren't always signs of sloppy thinking -- they may actually point to deep issues that aren't apparently on the surface. I imagine the typical use case for sample() to be something that samples the population once and then does something to the sample; the next time sample() is called the population is probably different (e.g. the next lottery has a different set of players). But I imagine a fairly common use case for choice() to be choosing from the same population over and over, and that's exactly the case where the copying implementation you're proposing would be a small disaster. -- --Guido van Rossum (python.org/~guido)
participants (20)
-
Ben Finney
-
Chris Angelico
-
Chris Barker - NOAA Federal
-
David Mertz
-
Ethan Furman
-
Grant Jenks
-
Greg Ewing
-
Guido van Rossum
-
Ian Kelly
-
Joao S. O. Bueno
-
M.-A. Lemburg
-
Mahmoud Hashemi
-
Marco Cognetta
-
Michael Selik
-
Paul Moore
-
Rob Cliffe
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Tim Peters