This still seems wrong *unless* we add a protocol to select a random item in O(1) time. There currently isn't one for sets and mappings -- only for sequences. Hm, would this need a change to the structure of dicts and sets? Is there by any chance a quick way of getting the Nth item added in an OrderedDict? It would be pretty terrible if someone wrote code to get N items from a set of size N and their code ended up being O(N**2). Er, I'm not quite sure what you're saying here. random.sample allows you to select k items from a set of size N (it converts the set to a tuple). If k=N, i.e. you're trying to select the whole set, that's a silly thing to do if you know you're doing it. But even in this case, random.sample is approximately O(N); I tested it. If someone writes their own inefficient version of random.sample, that's
Thanks to everyone for all the feedback so far. Interesting. I had not heard of reservoir sampling, and it had not occurred to me that one could select a uniformly random element from a sequence of unknown length without copying it. I am with Ben in that I consider the most important type for random.choice to accept, that it doesn't already, is "set". If just that could be achieved, yes Ben, I think that would be a plus. (The fact that I could write a version that handled generators (efficiently or not) just seemed to be a bonus.) ISTM however that there is a problem accepting dict, because of the ambiguity - does it mean a random key, a random value, or a random (key,value) item? (I would pick a random key (consistent with "for k in MyDict"), but ISTM "explicit is better than implict".) There are problems with arbitrary iterables as Ben points out. But I don't see these as an objection to making random.choice accept them, because (1) It is easy to produce examples of stdlib functions that can be crashed with infinite iterables, e.g. >>> def forever(): ... while 1: yield 1 ... >>> itertools.product(forever(), forever()) Traceback (most recent call last): File "<stdin>", line 1, in <module> MemoryError (2) As I indicated in my original post, it is understood that generators are "throw-away", so may need to be copied. 'Consenting adults', no? One small point: Whenever I've tested it (quite a few times, for different projects), I've always found that it was faster to copy to a tuple than to a list. I did a little experimenting (admittedly non-rigorous, on one platform only, and using Python 2.7.10, not Python 3, and using code which could very possibly be improved) on selecting a random element from a generator, and found that for small or moderate generators, reservoir sampling was almost always slower than generating a tuple as the generator length increased up to roughly 10,000, the ratio (time taken by reservoir) / (time taken by tuple) increased, reaching a maximum of over 4 as the generator length increased further, the ratio started to decrease, although for a length of 80 million (about as large as I could test) it was still over 3. (This would change if the reservoir approach could be recoded in a way that was amazingly faster than the way I did it, but, arrogance aside, I doubt that. I will supply details of my code on request.) I think this is a tribute to the efficiency of tuple building. It's also enough to convince me that reservoir sampling, or Marco Cognetta's compromise approach, are non-starters. More rigorous testing would be necessary to convince the rest of the world. I am also pretty well convinced from actual tests (not from theoretical reasoning) that: the "convert it to a tuple" recipe is not far off O(N), both for sets and generators (it gets worse for generators that produce objects with large memory footprints), and is certainly fast enough to be useful (the point at which it gets unacceptably slow is generally not far from the point where you run out of memory). I did try an approach similar to Chris Angelico's of iterating over sets up to the selection point (my bikeshed actually used enumerate rather than zip), but it was much slower than the "tuple-ise" algorithm. On 13/04/2016 00:04, Guido van Rossum wrote: their problem. Have I misunderstood? Do you mean selecting N items with replacement? Hm, it occurs to me that random.sample itself could be extended to choose k unique random elements from a general iterable. I have seen a reservoir sampling algorithm that claims to be O(N): https://en.wikipedia.org/wiki/Reservoir_sampling Rob On Tue, Apr 12, 2016 at 3:57 PM, Ben Finney <ben+python@benfinney.id.au> wrote:
Rob Cliffe <rob.cliffe@btinternet.com> writes:
It surprised me a bit the first time I realised that random.choice did not work on a set. (One expects "everything" to "just work" in Python! :-) ) I am +1 on the notion that an instance of ‘tuple’, ‘list’, ‘set’, and even ‘dict’, should each be accepted as input for ‘random.choice’.
On 2011-06-23, Sven Marnach posted 'A few suggestions for the random module' on Python-Ideas, one of which was to allow random.choice to work on an arbitrary iterable. I am −1 on the notion of an arbitrary iterable as ‘random.choice’ input.
Arbitrary iterables may be consumed merely by being iterated. This will force the caller to make a copy, which may as well be a normal sequence type, defeating the purpose of accepting that “arbitrary iterable” type.
Arbitrary iterables may never finish iterating. This means the call to ‘random.choice’ would sometimes never return.
The ‘random.choice’ function should IMO not be responsible for dealing with those cases.
Instead, a more moderate proposal would be to have ‘random.choice’ accept an arbitrary container.
If the object implements the container protocol (by which I think I mean that it conforms to ‘collections.abc.Container’), it is safe to iterate and to treat its items as a collection from which to choose a random item.
Rob, does that proposal satisfy the requirements that motivated this?
-- \ “Some people, when confronted with a problem, think ‘I know, | `\ I'll use regular expressions’. Now they have two problems.” | _o__) —Jamie Zawinski, in alt.religion.emacs | Ben Finney
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/