[Python-ideas] Buffering iterators?

Tue Dec 15 05:04:17 EST 2015

(This would be a lot of work that I wouldn't know how to do, but is it
worth thinking about? Maybe it's already been done at the level
necessary.

Also, this is a proposal for the sake of theoretical optimization, in
case you don't like that, and it will require a lot of work in a lot
of code everywhere.

As I see it, even if it's possible and desirable to do this, it would
take years of work and testing to make it beneficial.)

The move from Python 2 (disclaimer: which I barely touched, so I have
little sentimental attachment) to Python 3 resulted in many functions
returning iterators instead of lists. This saves a lot of unnecessary
memory when iterating, say, over the indices of a large list,
especially if we break in the middle.

I'm wondering, though, about a step backwards: generating values
before they're needed. The idea is based on file buffered reading and
memory prefetching
(https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory#DDR_SDRAM_prefetch_architecture).
In fact, I'm hoping to take advantage of such things.

For example, in `sum(lst[i] * i for i in range(10000))`, `sum` will
exhaust the iterator, so it can ask the generator to return buffers,
and it will internally read the elements off the lists.

It would be the responsibility of the iterator to decide whether to
respect the request, and to determine the size of the buffer. It would
be the responsibility of the consumer to request it, and consumers
should only request it if they think they'll almost definitely consume
a lot at a time.

The idea is, especially for complex nested iterators, instead of
running A B C A B C A B C..., where each is the code for generating a
next thing from the previous, that the interpreter runs A A A A A...,
B B B B B..., C C C..., which could mean a lot more memory locality in
both instructions and objects.

There's the possibility that a function has side-effects, so buffering
would have different semantics than normal. There's also the
possibility that getting the next element is complex enough that it
wouldn't help to buffer. If the iterator can't tell, then it should
just not buffer.

Here's an obnoxious example of where you can't tell:

    def f(i):
        return i

    s = 0
    for i in (f(x) for x in range(100)):
        s += f(i)
        def f(x):
            return x + s

In fact, all Python function calls are probably unsafe, including with
operators (which can be legally replaced during the iteration). Well,
`map` and `filter` are possible exceptions in special cases, because
the lookup for their function is bound at the call to `map`.

It's usually safe if you're just using reversed, enumerate, builtin
operators, dict views, etc. on an existing data structure, as long as
your iteration doesn't modify entries, unlike so:

    for i, x in enumerate(reversed(lst)):
        lst[i+1] = x

But I'm looking toward the future, where it might be possible for the
interpreter to analyze loops and functions before making such
decisions. Then again, if the interpreter is that smart, it can figure
out where and when to buffer without adding to the API of iterators.

Anyway, here's an idea, how it might be helpful, and how it might not
be helpful.