[Python-ideas] Re: Warn when iterating over an already exhausted generator

June 13, 2023

...
...
If I wanted sorted numbers, then ValueError wouldn’t help, because I do not get sorted numbers.
I do want sorted numbers, but what can Python do in the face of broken code? There's a reason it raises errors for 1/0, str.invalid, and len(None). It's not "helpful" to the program, but it stops execution from continuing with a bad state.
What I meant was:
numbers = (i for i in range(10))
assert 3 in numbers
next(numbers)    # 4!

This wouldn’t raise an error with your fix, but still would be a bug.
...
Here's a small selection of the StackOverflow questions from people who encountered this exact issue:
...
https://stackoverflow.com/questions/25336726/why-cant-i-iterate-twice-over-t... <https://stackoverflow.com/questions/25336726/why-cant-i-iterate-twice-over-t...>
https://stackoverflow.com/questions/10255273/iterating-on-a-file-doesnt-work-the-second-time?noredirect=1&lq=1 <https://stackoverflow.com/questions/10255273/iterating-on-a-file-doesnt-work-the-second-time?noredirect=1&lq=1>
https://stackoverflow.com/questions/3906137/why-cant-i-call-read-twice-on-an... <https://stackoverflow.com/questions/3906137/why-cant-i-call-read-twice-on-an...>
https://stackoverflow.com/questions/17777219/zip-variable-empty-after-first-... <https://stackoverflow.com/questions/17777219/zip-variable-empty-after-first-...>
https://stackoverflow.com/questions/42246819/loop-over-results-from-path-glo... <https://stackoverflow.com/questions/42246819/loop-over-results-from-path-glo...>
https://stackoverflow.com/questions/21715268/list-returned-by-map-function-d... <https://stackoverflow.com/questions/21715268/list-returned-by-map-function-d...>
https://stackoverflow.com/questions/14637154/performing-len-on-list-of-a-zip... <https://stackoverflow.com/questions/14637154/performing-len-on-list-of-a-zip...>
https://stackoverflow.com/questions/44420135/filter-object-becomes-empty-aft... <https://stackoverflow.com/questions/44420135/filter-object-becomes-empty-aft...>
This raises a question - how many people have done the same mistake more than once? I.e. are all of those just google searches on first encounter of an iterator, or is it difficult to memorise never to use iterator twice...

Regardless, python has largely adapted iterator standard in many places so that memory consumption is reduced. It has served me numerous times when I did not have enough memory to hold the full list. This comes at the expense of having to be aware of it.

The full issue as I see it is “warn/error error, if `__next__` has been called at least once".
While the solution you are proposing is only covering half of the issue, which is “warn/error if the iterator has been FULLY consumed”.

So, in my opinion, the solution to this bug prevention, given the situation, should be more comprehensive and well thought out, so that:
1. It breaks minimal amount of existing code.
2. It has enough positive impact for it to be worthwhile.

Your current proposal, as I see it, is lacking on both points. 

Some of the thoughts:
1. General `dummy` global python option, which raises an error or issues a warning if `__iter__` method has been called a second time, so that the ones, who want, can set it to true. But this would cause library cross-compatibility issues and would generally prevent adapting a coding style which was intended when creating iterators.
2. Using another paradigm. E.g. piping via functional programming. Something like:
```
import re
error_regex = re.compile('^ERROR: ‘)

with open('logs.txt') as f:
    n_lines, longest_error, n_unique_errors = f.multiplex(
        len,
        filter(error_regex.match).max(key=len, default=‘’).multiplex(
            PIPE,
            len().set()
        )
    ])
print(f’{n_lines=}\n{longest_error=}\n{n_unique_errors=}')
```
It actually would be nice to have a standard library with comprehensive list of functional programming tools. I think iterators are powerful and joining them with a flexible with multi-input/multi-output framework would bring them to the next level. It’s hello world would be "pub/sub pattern in 5 minutes”.

I went off the track a bit, but I am working on similar stuff now so couldn’t resist. :)

But it does seem that it’s either-either:
1. You work with iterators, get benefits, but be aware of how they work.
2. get used to pre-converting to lists and don’t care

I think your proposal is trying to mix things up here and I am not very positive about it. Maybe optional arguments, whether to return list or iterator… But the sensible default I think is iterator anyways, so doesn’t solve anything.

In general, I see your point. If it took care of partial exhaustion prevention as well, I might be more positive about it. As your current proposition goes, I would choose to leave things as they are.
...
On 14 Jun 2023, at 00:00, BoppreH <github@boppreh.com> wrote:
...
In close to 10 years of experience with python I have never encountered anything like this.
Note that questions usually get few votes, and "what's wrong with my code" questions are especially poorly received, so getting even a couple of votes is a strong signal. The questions above range from 10 to 124 (!) votes, and have a combined 250k+ views.
These are the people I'd like to help.
...
If you could give a full real-life scenario, then it might expose the problem (if it exists) better.
Open a log file, count the number of lines, then find both the longest and number of unique "error" entries. Implemented in the most obvious way I can, using builtin functions, it has *two* such bugs (reusing the exhausted "f" and "error_lines").
import re
error_regex = re.compile('^ERROR: ')
with open('logs.txt') as f:
    n_lines = len(list(f))
    error_lines = filter(error_regex.match, f)
    longest_error = max(error_lines, key=len, default='')
    n_unique_errors = len(set(error_lines))
print(f'{n_lines=}\n{longest_error=}\n{n_unique_errors=}')
Is it hard to fix? No, not all, just store "list(f)" and replace "filter" with a longer list comprehension. Is it easy to spot? For an experienced developer, in this short example, with all the parts introduced together, yes. But having a natural solution silently give wrong answers is dangerous. At least having a warning would break the false sense of security.
I understand that backwards compatibility will probably prevent us from raising a new error. But a warning could help a lot of people.
I'm tempted to patch the Python interpreter and test some popular packages, to verify if doing this on purpose is as rare as I think it is.
On Tue, Jun 13, 2023, at 6:50 PM, Dom Grigonis wrote:
...
In close to 10 years of experience with python I have never encountered anything like this.
If I need to use a list later I never do ANY assignments to it. Why would I?
In the last example I would:
```
strings = ['aa', '', 'bbb', 'c’]
longest = max(filter(bool, strings), key=len)
n_unique = len(set(strings))
```
And in initial example I don’t see why would I ever do this. It is very unclear what is the scenario here:
```???
numbers = (i for i in range(5))
assert 5 not in numbers
sorted(numbers)
```
1. If I wanted sorted numbers, then ValueError wouldn’t help, because I do not get sorted numbers.
2. If I wanted unmodified list and if it was modified then it is an error, your solution doesn’t work either.
3. If sorting is ok only on non-empty iterator, then just `assert sorted` after sorting.
If you could give a full real-life scenario, then it might expose the problem (if it exists) better.
"There should be one-- and preferably only one --obvious way to do it.”
There is either: something to be improved or you are not using that "one obvious" way.
...
On 13 Jun 2023, at 18:05, BoppreH via Python-ideas <python-ideas@python.org <mailto:python-ideas@python.org>> wrote:
@ChrisA: Shadowing "iter()" would only help with Barry's example.
@Jonathan: Updating documentation is helpful, but I find an automated check better. Too often the most obvious way to accomplish something silently triggers this behavior:
strings = ['aa', '', 'bbb', 'c']
strings = filter(bool, strings) # Adding this step makes n_unique always 0.
longest = max(strings, key=len)
n_unique = len(set(strings))
I feel like a warning here would save time and prevent bugs, and that my is_exhausted proposal, if implemented directly in the generators, is an easy way to accomplish this.
And I have to say I'm surprised by the responses. Does nobody else hit bugs like this and wish they were automatically detected? To be clear, raising ValueError is just an example; logging a warning would already be helpful, like Go's race condition detector.
--
BoppreH
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org <mailto:python-ideas@python.org>
To unsubscribe send an email to python-ideas-leave@python.org <mailto:python-ideas-leave@python.org>
https://mail.python.org/mailman3/lists/python-ideas.python.org/ <https://mail.python.org/mailman3/lists/python-ideas.python.org/>
Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/KWBFRI... <https://mail.python.org/archives/list/python-ideas@python.org/message/KWBFRI...>
Code of Conduct: http://python.org/psf/codeofconduct/ <http://python.org/psf/codeofconduct/>