Hi. In writing my previous email, I noticed something about zip() that I'd not seen before (but is obvious, I guess) - when it reaches the shortest sequence and terminates, any iterators already processed in that pass will have generated one extra value than the others. Those additional values are discarded. For example: h = iter("Hello") w = iter("World") s = iter("Spam") e = iter("Eggs") for i in zip(h, w, s, e): print(i) for i in (h, w, s, e): print(list(i)) ---> All iterators are exhausted. h = iter("Hello") w = iter("World") s = iter("Spam") e = iter("Eggs") for i in zip(h, s, e, w): print(i) for i in (h, w, s, e): print(list(i)) ---> "w" still has the trailing 'd' character. So, if you're using zip() over itertools.zip_longest() then you have to be careful of the order of your arguments and try to put the probably-shortest one first if this would otherwise cause problems. The reason I'm posting to 'ideas' is: what should/could be done about it? 1) A simple warning in the docstring for zip()? 2) Something to prevent it (for example a keyword argument to zip() to switch on some behaviour where the iterators are first queried that they have more items to generate before the values start being consumed)? 3) Nothing. There are bigger things to worry about ;) WRT (2), I thought that perhaps __len__ was part of the iterator protocol, but it's not (just __iter__ and __next__), hence:
len(range(5, 40)) 35 len(iter(range(5, 40))) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: object of type 'range_iterator' has no len() len(iter("FooBar")) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: object of type 'str_iterator' has no len()
... though would that also be something to consider (I guess all iterators would have to keep some state regarding the amount of values previously generated and then apply that offset to the result of len() on the underlying object)? Perhaps that would just be too heavyweight for what is a relatively minor wart. E.
On Fri, Feb 12, 2016 at 6:39 PM Erik <python@lucidity.plus.com> wrote:
Hi.
In writing my previous email, I noticed something about zip() that I'd not seen before (but is obvious, I guess) - when it reaches the shortest sequence and terminates, any iterators already processed in that pass will have generated one extra value than the others. Those additional values are discarded.
For example:
h = iter("Hello") w = iter("World") s = iter("Spam") e = iter("Eggs")
for i in zip(h, w, s, e): print(i)
for i in (h, w, s, e): print(list(i))
---> All iterators are exhausted.
h = iter("Hello") w = iter("World") s = iter("Spam") e = iter("Eggs")
for i in zip(h, s, e, w): print(i)
for i in (h, w, s, e): print(list(i))
---> "w" still has the trailing 'd' character.
So, if you're using zip() over itertools.zip_longest() then you have to be careful of the order of your arguments and try to put the probably-shortest one first if this would otherwise cause problems.
The reason I'm posting to 'ideas' is: what should/could be done about it?
1) A simple warning in the docstring for zip()?
I wouldn't want to clutter the docstring, but a note in the long-form documentation could be useful.
2) Something to prevent it (for example a keyword argument to zip() to switch on some behaviour where the iterators are first queried that they have more items to generate before the values start being consumed)?
How can you query whether an iterator has another value without consuming that value?
3) Nothing. There are bigger things to worry about ;)
WRT (2), I thought that perhaps __len__ was part of the iterator protocol, but it's not (just __iter__ and __next__), hence:
len(range(5, 40)) 35 len(iter(range(5, 40))) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: object of type 'range_iterator' has no len() len(iter("FooBar")) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: object of type 'str_iterator' has no len()
... though would that also be something to consider (I guess all iterators would have to keep some state regarding the amount of values previously generated and then apply that offset to the result of len() on the underlying object)? Perhaps that would just be too heavyweight for what is a relatively minor wart.
How would you handle the length of an infinite iterator? Or one that *might* be infinite, depending on current state of the program? A more realistic example: if I'm looking up N records from a distributed database, I might do that in parallel and get the results back unordered, as an iterator. If M of the queries timeout, I might choose to ignore those records and exclude them from the resulting iterator. So, when I kick off the queries, the length of that iterator might be N. When the timeouts are finished, the length is N-M. Further, if I've consumed 2 records, is the length still N-M or N-M-2?
BTW, from the documentation ( https://docs.python.org/3/library/functions.html#zip): "zip() <https://docs.python.org/3/library/functions.html#zip> should only be used with unequal length inputs when you don’t care about trailing, unmatched values from the longer iterables. If those values are important, useitertools.zip_longest() <https://docs.python.org/3/library/itertools.html#itertools.zip_longest> instead." On Fri, Feb 12, 2016 at 6:50 PM Michael Selik <mike@selik.org> wrote:
On Fri, Feb 12, 2016 at 6:39 PM Erik <python@lucidity.plus.com> wrote:
Hi.
In writing my previous email, I noticed something about zip() that I'd not seen before (but is obvious, I guess) - when it reaches the shortest sequence and terminates, any iterators already processed in that pass will have generated one extra value than the others. Those additional values are discarded.
For example:
h = iter("Hello") w = iter("World") s = iter("Spam") e = iter("Eggs")
for i in zip(h, w, s, e): print(i)
for i in (h, w, s, e): print(list(i))
---> All iterators are exhausted.
h = iter("Hello") w = iter("World") s = iter("Spam") e = iter("Eggs")
for i in zip(h, s, e, w): print(i)
for i in (h, w, s, e): print(list(i))
---> "w" still has the trailing 'd' character.
So, if you're using zip() over itertools.zip_longest() then you have to be careful of the order of your arguments and try to put the probably-shortest one first if this would otherwise cause problems.
The reason I'm posting to 'ideas' is: what should/could be done about it?
1) A simple warning in the docstring for zip()?
I wouldn't want to clutter the docstring, but a note in the long-form documentation could be useful.
2) Something to prevent it (for example a keyword argument to zip() to switch on some behaviour where the iterators are first queried that they have more items to generate before the values start being consumed)?
How can you query whether an iterator has another value without consuming that value?
3) Nothing. There are bigger things to worry about ;)
WRT (2), I thought that perhaps __len__ was part of the iterator protocol, but it's not (just __iter__ and __next__), hence:
len(range(5, 40)) 35 len(iter(range(5, 40))) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: object of type 'range_iterator' has no len() len(iter("FooBar")) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: object of type 'str_iterator' has no len()
... though would that also be something to consider (I guess all iterators would have to keep some state regarding the amount of values previously generated and then apply that offset to the result of len() on the underlying object)? Perhaps that would just be too heavyweight for what is a relatively minor wart.
How would you handle the length of an infinite iterator? Or one that *might* be infinite, depending on current state of the program?
A more realistic example: if I'm looking up N records from a distributed database, I might do that in parallel and get the results back unordered, as an iterator. If M of the queries timeout, I might choose to ignore those records and exclude them from the resulting iterator. So, when I kick off the queries, the length of that iterator might be N. When the timeouts are finished, the length is N-M. Further, if I've consumed 2 records, is the length still N-M or N-M-2?
On Feb 12, 2016, at 15:51, Michael Selik <mike@selik.org> wrote:
BTW, from the documentation (https://docs.python.org/3/library/functions.html#zip):
"zip() should only be used with unequal length inputs when you don’t care about trailing, unmatched values from the longer iterables. If those values are important, useitertools.zip_longest() instead."
I think what's missing (from his point of view) is some statement that if you call it with iterators, it's not just the trailing values, but also the iterators' states that you shouldn't care about. I always took that as read without it needing to be stated. But maybe it does need stating?
On 13/02/16 00:09, Andrew Barnert wrote:
On Feb 12, 2016, at 15:51, Michael Selik <mike@selik.org <mailto:mike@selik.org>> wrote:
BTW, from the documentation (https://docs.python.org/3/library/functions.html#zip):
[snip]
I think what's missing (from his point of view) is some statement that if you call it with iterators, it's not just the trailing values, but also the iterators' states that you shouldn't care about.
Yes. And also, as someone else pointed out to me privately, the docstring can be interpreted as already covering this: """ Return a zip object whose .__next__() method returns a tuple where the i-th element comes from the i-th iterable argument. The .__next__() method continues until the shortest iterable in the argument sequence is exhausted and then it raises StopIteration. """ It just depends on what "The .__next__() method continues" is supposed to mean. I can see that for the obvious implementation it means what actually happens (because the method is implemented in the obvious way ;)), but it _could_ be interpreted as meaning that when __next__ is called when the shortest iterable is exhausted then it does not do anything at all. This is in the docstring of a function that will be called by casual and newbie users. Are they expected to read up on what __next__ means and mentally imagine the mechanics of the loop that is implementing this function for them so they understand all the side-effects?
I always took that as read without it needing to be stated. But maybe it does need stating?
There is also the issue that the CPython implementation of this is not necessarily the _only_ way of implementing this. Another implementation might construct the tuple in reverse order for example, and a different set of iterators have the extra value consumed. I don't think it's unreasonable to state clearly that _any_ iterator longer than the shortest may or may not have at least one extra value extracted from it, which will then be discarded. Anyway, I'm over it now. I just thought I'd mention it. I've obviously never run into a real problem with it in the wild so perhaps it's really not an issue. E.
participants (3)
-
Andrew Barnert
-
Erik
-
Michael Selik