[Python-ideas] zip() problem.

Fri Feb 12 18:51:34 EST 2016

BTW, from the documentation (
https://docs.python.org/3/library/functions.html#zip):

"zip() <https://docs.python.org/3/library/functions.html#zip> should only
be used with unequal length inputs when you don’t care about trailing,
unmatched values from the longer iterables. If those values are important,
useitertools.zip_longest()
<https://docs.python.org/3/library/itertools.html#itertools.zip_longest>
instead."

On Fri, Feb 12, 2016 at 6:50 PM Michael Selik <mike at selik.org> wrote:

> On Fri, Feb 12, 2016 at 6:39 PM Erik <python at lucidity.plus.com> wrote:
>
>> Hi.
>>
>> In writing my previous email, I noticed something about zip() that I'd
>> not seen before (but is obvious, I guess) - when it reaches the shortest
>> sequence and terminates, any iterators already processed in that pass
>> will have generated one extra value than the others. Those additional
>> values are discarded.
>>
>> For example:
>>
>> h = iter("Hello")
>> w = iter("World")
>> s = iter("Spam")
>> e = iter("Eggs")
>>
>> for i in zip(h, w, s, e):
>>    print(i)
>>
>> for i in (h, w, s, e):
>>    print(list(i))
>>
>> ---> All iterators are exhausted.
>>
>> h = iter("Hello")
>> w = iter("World")
>> s = iter("Spam")
>> e = iter("Eggs")
>>
>> for i in zip(h, s, e, w):
>>    print(i)
>>
>> for i in (h, w, s, e):
>>    print(list(i))
>>
>>
>> ---> "w" still has the trailing 'd' character.
>>
>>
>> So, if you're using zip() over itertools.zip_longest() then you have to
>> be careful of the order of your arguments and try to put the
>> probably-shortest one first if this would otherwise cause problems.
>>
>>
>> The reason I'm posting to 'ideas' is: what should/could be done about it?
>>
>> 1) A simple warning in the docstring for zip()?
>>
>
> I wouldn't want to clutter the docstring, but a note in the long-form
> documentation could be useful.
>
>
>> 2) Something to prevent it (for example a keyword argument to zip() to
>> switch on some behaviour where the iterators are first queried that they
>> have more items to generate before the values start being consumed)?
>>
>
> How can you query whether an iterator has another value without consuming
> that value?
>
>
>> 3) Nothing. There are bigger things to worry about ;)
>>
>> WRT (2), I thought that perhaps __len__ was part of the iterator
>> protocol, but it's not (just __iter__ and __next__), hence:
>>
>>  >>> len(range(5, 40))
>> 35
>>  >>> len(iter(range(5, 40)))
>> Traceback (most recent call last):
>>    File "<stdin>", line 1, in <module>
>> TypeError: object of type 'range_iterator' has no len()
>>  >>> len(iter("FooBar"))
>> Traceback (most recent call last):
>>    File "<stdin>", line 1, in <module>
>> TypeError: object of type 'str_iterator' has no len()
>>
>> ... though would that also be something to consider (I guess all
>> iterators would have to keep some state regarding the amount of values
>> previously generated and then apply that offset to the result of len()
>> on the underlying object)? Perhaps that would just be too heavyweight
>> for what is a relatively minor wart.
>>
>
> How would you handle the length of an infinite iterator? Or one that
> *might* be infinite, depending on current state of the program?
>
> A more realistic example: if I'm looking up N records from a distributed
> database, I might do that in parallel and get the results back unordered,
> as an iterator. If M of the queries timeout, I might choose to ignore those
> records and exclude them from the resulting iterator. So, when I kick off
> the queries, the length of that iterator might be N. When the timeouts are
> finished, the length is N-M. Further, if I've consumed 2 records, is the
> length still N-M or N-M-2?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20160212/ff696ee0/attachment.html>