[Python-Dev] RE: [Python-checkins] python/nondist/peps pep-0323.txt, NONE, 1.1 pep-0000.txt, 1.254, 1.255

Excellent PEP! Consider adding your bookmarking example. I found it to be a compelling use case. Also note that there are many variations of the bookmarking theme (undo utilities, macro recording, parser lookahead functions, backtracking, etc). Under drawbacks and issues there are a couple of thoughts: * Not all iterators will be copyable. Knowing which is which creates a bit of a usability issue (i.e. the question of whether a particular iterator is copyable will come up every time) and a substitution issue (i.e. code which depends on copyability precludea substitution of other iterators that don't have copyability). * In addition to knowing whether a given iterator is copyable, a user should also know whether the copy is lightweight (just an index or some such) or heavy (storing all of the data for future use). They should know whether it is active (intercepting every call to iter()) or inert. * For heavy copies, there is a performance trap when the stored data stream gets too long. At some point, just using list() would be better. Consider adding a section with pure python sample implementations for listiter.__copy__, dictiter.__copy__, etc. Also, I have a question about the semantic specification of what a copy is supposed to do. Does it guarantee that the same data stream will be reproduced? For instance, would a generator of random words expect its copy to generate the same word sequence. Or, would a copy of a dictionary iterator change its output if the underlying dictionary got updated (i.e. should the dict be frozen to changes when a copy exists or should it mutate). Raymond Hettinger

Every attempt should be made for the two copies to return exactly the same stream of values. This is the pure tee() semantics. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Tuesday 28 October 2003 01:54 am, Guido van Rossum wrote:
Yes, but iterators that run on underlying containers don't guarantee, in general, what happens if the container is mutated while the iteration is going on -- arbitrary items may end up being skipped, repeated, etc. So, "every attempt" is, I feel, too strong here. deepcopy exists for those cases where one is ready to pay a hefty price for guarantees of "decoupling", after all. Alex

Maybe. I agree that for list and dict iterators, if the list is mutated, this warrantee shall be void. But I strongly believe that cloning a random iterator should cause two identical streams of numbers, not two different random streams. If you want two random streams you should create two independent iterators. Most random number generators have a sufficiently small amount of state that making a copy isn't a big deal. If it is hooked up to an external source (e.g. /dev/random) then I'd say you'd have to treat it as a file, and introduce explicit buffering.
deepcopy exists for those cases where one is ready to pay a hefty price for guarantees of "decoupling", after all.
But I don't propose that iterators support __deepcopy__. The use case is very different. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Tuesday 28 October 2003 04:15 pm, Guido van Rossum wrote: ...
OK, noticed -- and I'll clarify the PEP about this, thanks.
I really truly REALLY like this. I _was_ after all the one who lobbied Tim to add getstate and setstate to random.py, back in the not-so- long-ago time when I was a total Python newbie -- exactly because, being a NOT-newbie consumer of pseudo-random streams, I loathed and detested the pseudo-random generators that didn't allow me to reproduce experiments in this way. So, I entirely agree that if pseudo-random numbers are being consumed through a "pseudo-random iterator" the copy should indeed step through just the same numbers. Again, this will get in the PEP -- *thanks*! Btw, random.py doesn't seem to supply pseudo-random iterators -- easy to make e.g. with iter(random.random, None) [assuming you want a nonterminating one], but that wouldn't be copyable. Should something be done about that...? And as for NON-pseudo random numbers, such as those supplied by /dev/random and other external sources, yes, of course, they should NOT be copy'able -- best to let tee() work on them by making its buffer, or else wrap them in a buffer-to-file way if one needs to "snapshot" the sequence then re-"generate" a lot of it later for reproducibility purposes. I.e., absolute agreement.
Yes, the use case of __deepcopy__ is indeed quite different (and to be honest it doesn't appear in my actual experience -- I can "imagine" some as well as the next man, but they'd be made out of whole cloth:-). But I was under the impression that you wanted them in PEP 323 too? Maybe I misunderstood your words. Should I take them out of PEP 323? In that case somebody else can later PEP that if they want, and I can basically wash my hands of them -- what do you think? Alex

[Alex]
Probably a bit faster with: starmap(random.random, repeat(()))
but that wouldn't be copyable. Should something be done about that...?
No. 1) The use case is not typical. Other than random.random() and time.ctime(), it is rare to see functions of zero arguments that usefully return an infinite sequence of distinct values. 2) If you need a copy, run it through tee(). Raymond

On Tuesday 28 October 2003 06:09 pm, Raymond Hettinger wrote:
Yep, saving the useless compare does help: [alex@lancelot bo]$ timeit.py -c -s'import random' -s'import itertools as it' \
-s'xx=it.starmap(random.random, it.repeat(()))' 'xx.next()' 1000000 loops, best of 3: 1.37 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'import random' -s'import itertools as it' -s'xx=iter(random.random, None)' 'xx.next()' 1000000 loops, best of 3: 1.62 usec per loop Renewed compliments for itertools' speed!-)
Sorry, I must have been unclear -- I had no "zero arguments" limitation in mind. Rather, I thought of something like: it = random.iterator(random.Random().sample, range(8), 3) now each call to it.next() would return a random sample of 3 numbers from range(8) w/o repetition. I.e., the usual <makemeacallbackobject>(callable, *args) idiom (as in Tkinter widgets' .after method, etc, etc). What I'm saying is that there is no reason type(it) shouldn't support a __copy__ method -- as long as the underlying callable sports an im_self which also exposes a __getstate__ method, at least. Come to think of this, there may be other use cases for this general approach than "random iterators". Do you think that an iterator on a callable *and args for it* would live well in itertools? That module IS, after all, your baby...
2) If you need a copy, run it through tee().
That's exactly what I plan to do -- but I would NOT want tee() to consume O(N) memory [where N is how far out of step the two iterators may get] in those cases where the iterator argument DOES have a __copy__ method that can presumably produce a usable copy with O(1) memory expenditure. Thus, I'd like itertools.tee to start by checking if its argument iterator "is properly copyable". Guido has pointed out that it would not be safe to just try copy.copy(it), because that MIGHT produce a copy that does not satisfy "iterator copying" semantics requirements. As an example, he has repeatedly mentioned "an iterator on a tree which keeps ``a stack of indices''". Here, I think, is an indication of the kind of thing he fears (code untested beyond running it on that one example): import copy class TreeIter(object): def __init__(self, tree): self.tree = [tree] self.indx = [-1] def __iter__(self): return self def next(self): if not self.indx: raise StopIteration self.indx[-1] += 1 try: result = self.tree[-1][self.indx[-1]] except IndexError: self.tree.pop() self.indx.pop() if not self.indx: raise StopIteration return self.next() if type(result) is not list: return result self.tree.append(result) self.indx.append(-1) return self.next() x = [ [1,2,3], [4, 5, [6, 7, 8], 9], 10, 11, [12] ] print 'all items, one:', for i in TreeIter(x): print i, print print 'all items, two:', it = TreeIter(x) for i in it: print i, if i==6: cop = copy.copy(it) print print '>=6 items, one:', for i in cop: print i, print print '>=6 items, two:', it = TreeIter(x) for i in it: if i==6: cop = copy.deepcopy(it) for i in cop: print i, print Output is: [alex@lancelot bo]$ python treit.py all items, one: 1 2 3 4 5 6 7 8 9 10 11 12 all items, two: 1 2 3 4 5 6 7 8 9 10 11 12
=6 items, one: =6 items, two: 7 8 9 10 11 12
i.e., the "iterator copy" returned by copy.copy does NOT satisfy requirements! (I've added the last tidbit to show that the one returned by copy.deepcopy WOULD satisfy them, but, it's clearly WAY too memory-costly to consider, far worse than tee()!!!!). So, "safely copying an iterator" means ensuring the iterator's author HAS thought specifically about allowing a copy -- in which case, we can (well, we _must_:-) trust that they have implemented things correctly. Just using copy.copy(it) MIGHT fall afoul of a default shallow copy not being sufficient. Perhaps we can get by with checking for, and using if found, a __copy__ method only. Is there a specific need to support __setstate__ etc, here? I hope not -- still, these, too, are things I must make sure are mentioned in the PEP. Alex

I think it would be better of PEP 323 only did __copy__, so you can remove all traces of __deepcopy__. I don't recall what I said, maybe I wasn't clear. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Monday 27 October 2003 11:45 pm, Raymond Hettinger wrote:
I will -- thanks!
Yes, I'll have to mention that (that the royal road for user code to access "iterator copying" functionality is via tee() when feasible).
Heavy copies should be left to 'tee' more often than not.
* For heavy copies, there is a performance trap when the stored data stream gets too long. At some point, just using list() would be better.
Or saving to disk, beyond a further threshold.
Consider adding a section with pure python sample implementations for listiter.__copy__, dictiter.__copy__, etc.
OK, but some of it's gonna be very-pseudo code (how do you mimic dictiter's real behaviour in pure Python...?).
I'll have to clarify this as for followup discussion on this thread -- pseudorandom iterators (I'll give an example) should be copyable and ensure the same stream from original and copy, real-random iterators (e.g. from /dev/random) not, iterators on e.g. lists and dicts should not freeze the underlying contained when copied any more than they do when first generated (in general if you mutate a dict or list you're iterating on, Python doesn't guarantee "sensible" behavior...). Thanks, Alex

Every attempt should be made for the two copies to return exactly the same stream of values. This is the pure tee() semantics. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Tuesday 28 October 2003 01:54 am, Guido van Rossum wrote:
Yes, but iterators that run on underlying containers don't guarantee, in general, what happens if the container is mutated while the iteration is going on -- arbitrary items may end up being skipped, repeated, etc. So, "every attempt" is, I feel, too strong here. deepcopy exists for those cases where one is ready to pay a hefty price for guarantees of "decoupling", after all. Alex

Maybe. I agree that for list and dict iterators, if the list is mutated, this warrantee shall be void. But I strongly believe that cloning a random iterator should cause two identical streams of numbers, not two different random streams. If you want two random streams you should create two independent iterators. Most random number generators have a sufficiently small amount of state that making a copy isn't a big deal. If it is hooked up to an external source (e.g. /dev/random) then I'd say you'd have to treat it as a file, and introduce explicit buffering.
deepcopy exists for those cases where one is ready to pay a hefty price for guarantees of "decoupling", after all.
But I don't propose that iterators support __deepcopy__. The use case is very different. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Tuesday 28 October 2003 04:15 pm, Guido van Rossum wrote: ...
OK, noticed -- and I'll clarify the PEP about this, thanks.
I really truly REALLY like this. I _was_ after all the one who lobbied Tim to add getstate and setstate to random.py, back in the not-so- long-ago time when I was a total Python newbie -- exactly because, being a NOT-newbie consumer of pseudo-random streams, I loathed and detested the pseudo-random generators that didn't allow me to reproduce experiments in this way. So, I entirely agree that if pseudo-random numbers are being consumed through a "pseudo-random iterator" the copy should indeed step through just the same numbers. Again, this will get in the PEP -- *thanks*! Btw, random.py doesn't seem to supply pseudo-random iterators -- easy to make e.g. with iter(random.random, None) [assuming you want a nonterminating one], but that wouldn't be copyable. Should something be done about that...? And as for NON-pseudo random numbers, such as those supplied by /dev/random and other external sources, yes, of course, they should NOT be copy'able -- best to let tee() work on them by making its buffer, or else wrap them in a buffer-to-file way if one needs to "snapshot" the sequence then re-"generate" a lot of it later for reproducibility purposes. I.e., absolute agreement.
Yes, the use case of __deepcopy__ is indeed quite different (and to be honest it doesn't appear in my actual experience -- I can "imagine" some as well as the next man, but they'd be made out of whole cloth:-). But I was under the impression that you wanted them in PEP 323 too? Maybe I misunderstood your words. Should I take them out of PEP 323? In that case somebody else can later PEP that if they want, and I can basically wash my hands of them -- what do you think? Alex

[Alex]
Probably a bit faster with: starmap(random.random, repeat(()))
but that wouldn't be copyable. Should something be done about that...?
No. 1) The use case is not typical. Other than random.random() and time.ctime(), it is rare to see functions of zero arguments that usefully return an infinite sequence of distinct values. 2) If you need a copy, run it through tee(). Raymond

On Tuesday 28 October 2003 06:09 pm, Raymond Hettinger wrote:
Yep, saving the useless compare does help: [alex@lancelot bo]$ timeit.py -c -s'import random' -s'import itertools as it' \
-s'xx=it.starmap(random.random, it.repeat(()))' 'xx.next()' 1000000 loops, best of 3: 1.37 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'import random' -s'import itertools as it' -s'xx=iter(random.random, None)' 'xx.next()' 1000000 loops, best of 3: 1.62 usec per loop Renewed compliments for itertools' speed!-)
Sorry, I must have been unclear -- I had no "zero arguments" limitation in mind. Rather, I thought of something like: it = random.iterator(random.Random().sample, range(8), 3) now each call to it.next() would return a random sample of 3 numbers from range(8) w/o repetition. I.e., the usual <makemeacallbackobject>(callable, *args) idiom (as in Tkinter widgets' .after method, etc, etc). What I'm saying is that there is no reason type(it) shouldn't support a __copy__ method -- as long as the underlying callable sports an im_self which also exposes a __getstate__ method, at least. Come to think of this, there may be other use cases for this general approach than "random iterators". Do you think that an iterator on a callable *and args for it* would live well in itertools? That module IS, after all, your baby...
2) If you need a copy, run it through tee().
That's exactly what I plan to do -- but I would NOT want tee() to consume O(N) memory [where N is how far out of step the two iterators may get] in those cases where the iterator argument DOES have a __copy__ method that can presumably produce a usable copy with O(1) memory expenditure. Thus, I'd like itertools.tee to start by checking if its argument iterator "is properly copyable". Guido has pointed out that it would not be safe to just try copy.copy(it), because that MIGHT produce a copy that does not satisfy "iterator copying" semantics requirements. As an example, he has repeatedly mentioned "an iterator on a tree which keeps ``a stack of indices''". Here, I think, is an indication of the kind of thing he fears (code untested beyond running it on that one example): import copy class TreeIter(object): def __init__(self, tree): self.tree = [tree] self.indx = [-1] def __iter__(self): return self def next(self): if not self.indx: raise StopIteration self.indx[-1] += 1 try: result = self.tree[-1][self.indx[-1]] except IndexError: self.tree.pop() self.indx.pop() if not self.indx: raise StopIteration return self.next() if type(result) is not list: return result self.tree.append(result) self.indx.append(-1) return self.next() x = [ [1,2,3], [4, 5, [6, 7, 8], 9], 10, 11, [12] ] print 'all items, one:', for i in TreeIter(x): print i, print print 'all items, two:', it = TreeIter(x) for i in it: print i, if i==6: cop = copy.copy(it) print print '>=6 items, one:', for i in cop: print i, print print '>=6 items, two:', it = TreeIter(x) for i in it: if i==6: cop = copy.deepcopy(it) for i in cop: print i, print Output is: [alex@lancelot bo]$ python treit.py all items, one: 1 2 3 4 5 6 7 8 9 10 11 12 all items, two: 1 2 3 4 5 6 7 8 9 10 11 12
=6 items, one: =6 items, two: 7 8 9 10 11 12
i.e., the "iterator copy" returned by copy.copy does NOT satisfy requirements! (I've added the last tidbit to show that the one returned by copy.deepcopy WOULD satisfy them, but, it's clearly WAY too memory-costly to consider, far worse than tee()!!!!). So, "safely copying an iterator" means ensuring the iterator's author HAS thought specifically about allowing a copy -- in which case, we can (well, we _must_:-) trust that they have implemented things correctly. Just using copy.copy(it) MIGHT fall afoul of a default shallow copy not being sufficient. Perhaps we can get by with checking for, and using if found, a __copy__ method only. Is there a specific need to support __setstate__ etc, here? I hope not -- still, these, too, are things I must make sure are mentioned in the PEP. Alex

I think it would be better of PEP 323 only did __copy__, so you can remove all traces of __deepcopy__. I don't recall what I said, maybe I wasn't clear. --Guido van Rossum (home page: http://www.python.org/~guido/)

On Monday 27 October 2003 11:45 pm, Raymond Hettinger wrote:
I will -- thanks!
Yes, I'll have to mention that (that the royal road for user code to access "iterator copying" functionality is via tee() when feasible).
Heavy copies should be left to 'tee' more often than not.
* For heavy copies, there is a performance trap when the stored data stream gets too long. At some point, just using list() would be better.
Or saving to disk, beyond a further threshold.
Consider adding a section with pure python sample implementations for listiter.__copy__, dictiter.__copy__, etc.
OK, but some of it's gonna be very-pseudo code (how do you mimic dictiter's real behaviour in pure Python...?).
I'll have to clarify this as for followup discussion on this thread -- pseudorandom iterators (I'll give an example) should be copyable and ensure the same stream from original and copy, real-random iterators (e.g. from /dev/random) not, iterators on e.g. lists and dicts should not freeze the underlying contained when copied any more than they do when first generated (in general if you mutate a dict or list you're iterating on, Python doesn't guarantee "sensible" behavior...). Thanks, Alex
participants (3)
-
Alex Martelli
-
Guido van Rossum
-
Raymond Hettinger