On Fri, Feb 12, 2016 at 6:39 PM Erik <python@lucidity.plus.com> wrote:
Hi.

In writing my previous email, I noticed something about zip() that I'd
not seen before (but is obvious, I guess) - when it reaches the shortest
sequence and terminates, any iterators already processed in that pass
will have generated one extra value than the others. Those additional
values are discarded.

For example:

h = iter("Hello")
w = iter("World")
s = iter("Spam")
e = iter("Eggs")

for i in zip(h, w, s, e):
   print(i)

for i in (h, w, s, e):
   print(list(i))

---> All iterators are exhausted.

h = iter("Hello")
w = iter("World")
s = iter("Spam")
e = iter("Eggs")

for i in zip(h, s, e, w):
   print(i)

for i in (h, w, s, e):
   print(list(i))


---> "w" still has the trailing 'd' character.


So, if you're using zip() over itertools.zip_longest() then you have to
be careful of the order of your arguments and try to put the
probably-shortest one first if this would otherwise cause problems.


The reason I'm posting to 'ideas' is: what should/could be done about it?

1) A simple warning in the docstring for zip()?

I wouldn't want to clutter the docstring, but a note in the long-form documentation could be useful.
 
2) Something to prevent it (for example a keyword argument to zip() to
switch on some behaviour where the iterators are first queried that they
have more items to generate before the values start being consumed)?

How can you query whether an iterator has another value without consuming that value?
 
3) Nothing. There are bigger things to worry about ;)

WRT (2), I thought that perhaps __len__ was part of the iterator
protocol, but it's not (just __iter__ and __next__), hence:

 >>> len(range(5, 40))
35
 >>> len(iter(range(5, 40)))
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
TypeError: object of type 'range_iterator' has no len()
 >>> len(iter("FooBar"))
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
TypeError: object of type 'str_iterator' has no len()

... though would that also be something to consider (I guess all
iterators would have to keep some state regarding the amount of values
previously generated and then apply that offset to the result of len()
on the underlying object)? Perhaps that would just be too heavyweight
for what is a relatively minor wart.

How would you handle the length of an infinite iterator? Or one that *might* be infinite, depending on current state of the program?

A more realistic example: if I'm looking up N records from a distributed database, I might do that in parallel and get the results back unordered, as an iterator. If M of the queries timeout, I might choose to ignore those records and exclude them from the resulting iterator. So, when I kick off the queries, the length of that iterator might be N. When the timeouts are finished, the length is N-M. Further, if I've consumed 2 records, is the length still N-M or N-M-2?