[Python-ideas] Caching iterators

Steven D'Aprano steve at pearwood.info
Wed Feb 26 01:01:09 CET 2014


On Tue, Feb 25, 2014 at 05:08:44PM -0600, Ryan Gonzalez wrote:
> Note:
> 
> I PROMISE this is a better idea than my last 20 bad ones(raise_if, shlex
> extra argument, etc.)

Don't make promises you can't keep. (Sorry, that was a cheap shot, but 
you practically begged for somebody to make it :-)


> I'll use an example to illustrate the idea first. Let's use a completely
> non-realistic and contrived example. Say you have a lexer that's really
> slow. Now, this lexer might be a function that uses generators, i.e.:
[...]
> Now, if we have a parser that uses that lexer, it always has to wait for
> the lexer to yield the next token *after* it already parsed the current
> token. That can be somewhat time consuming.

This is the nature of serial processing. Normally responsibility for 
moving to parallel processing (whether threads or processes or something 
else) is handled by the user, not the standard library. Very little in 
the std lib is calculated in parallel.


> Caching iterators are based on the idea: what if the iterator is running at
> the same time as the function getting the iterator elements? Or, better
> yet, it's an iterator wrapper that takes an iterator and continues to take
> its elements while the function that uses the iterator is running? This is
> easily accomplished using multiprocessing and pipes.
> 
> Since that was somewhat vague, here's an example:
> 
> def my_iterator():
>     for i in range(0,5):
>         time.sleep(0.2)
>         yield i
> for item in my_iterator():
>     time.sleep(0.5)
>     print(item)

So you have a delay in calculating each value in the iterator, and a 
longer delay using each value. If you perform those sequentially, it 
takes 0.7 seconds per item; if you could perform them perfectly in 
parallel, only 0.5 seconds.


> Now with a normal iterator, the flow is like this:
> 
>    - Wait for my_iterator to return an element(0.2s)
>    - Wait 0.5s and print the element(0.5s)
> 
> In total, that takes 0.7s per element. What a waste! What if the iterator
> was yielding elements at the same time as the for loop was using them?
> Well, for every for loop iteration, the iterator could generate ~2.2
> elements. That's what a caching iterator does. It runs both at the same
> time using multiprocessing.

You could use threads if the processes are IO bound rather than CPU 
bound. Since threads are less expensive than processing, surely that 
choice should have to be up to the user?

In either case, one disadvantage of a cache is that values are no longer 
being calculated lazily (i.e. on request). Because the iterator runs 
faster than the consumer, rather than working in lock-step you end up 
generating more values than you need: each iteration sees the iterator 
generate 2.2 elements and the consumer use 1. By the time the consumer 
has used ten thousand values, the iterator has generated twenty-two 
thousand values, and twelve thousand remain in the cache.

While tempting, the caller needs to be aware that such a caching system:

(1) uses potentially unbounded amounts of memory;

(2) is potentially harmful if calculating the values has side-effects;

(3) it can lead to "lost" data if the caller access the underlying 
iterator without going through the cache; and

(4) it is wasteful if the consumer stops early and never uses all the 
values. (CPU cycles are cheap, but they aren't free.)


None of these invalidate the basic idea, but they do limit the 
applicability of it.

[...]
> Now the flow is like this:
> 
>    - Wait for my_iterator to return the very first element.
>    - While that first element is looped over, continue recieving elements
>    from my_iterator(), storing them in an intermediate space(similar to a
>    deque).

The data structure you want is a Queue. The std lib has a thread-safe 
Queue implementation, in the queue module.


> I have a working implementation. Although there is a very slight overhead,
> in the above example, about 0.4s is still saved.

I think you should publish this implementation as a recipe on the 
ActiveState website, and see what feedback you get there. Once it is 
proven to be useful in practice, rather than just theoretically useful, 
then it could be considered for the std lib.

http://code.activestate.com/recipes/


-- 
Steven


More information about the Python-ideas mailing list