[Python-ideas] Caching iterators

Wed Feb 26 00:19:06 CET 2014

I believe the new asyncio package in 3.4 solves this problem (and more!).

Brendan
________________________________
From: Python-ideas [python-ideas-bounces+moloney=ohsu.edu at python.org] on behalf of Ryan Gonzalez [rymg19 at gmail.com]
Sent: Tuesday, February 25, 2014 3:08 PM
To: python-ideas
Subject: [Python-ideas] Caching iterators

Note:

I PROMISE this is a better idea than my last 20 bad ones(raise_if, shlex extra argument, etc.)

I’ll use an example to illustrate the idea first. Let’s use a completely non-realistic and contrived example. Say you have a lexer that’s really slow. Now, this lexer might be a function that uses generators, i.e.:

def mylexer(input):
    while input:
       ...
       if xyz:
           yield SomeToken()

Now, if we have a parser that uses that lexer, it always has to wait for the lexer to yield the next token after it already parsed the current token. That can be somewhat time consuming.

Caching iterators are based on the idea: what if the iterator is running at the same time as the function getting the iterator elements? Or, better yet, it’s an iterator wrapper that takes an iterator and continues to take its elements while the function that uses the iterator is running? This is easily accomplished using multiprocessing and pipes.

Since that was somewhat vague, here’s an example:

def my_iterator():
    for i in range(0,5):
        time.sleep(0.2)
        yield i

for item in my_iterator():
    time.sleep(0.5)
    print(item)

Now with a normal iterator, the flow is like this:

  *   Wait for my_iterator to return an element(0.2s)
  *   Wait 0.5s and print the element(0.5s)

In total, that takes 0.7s per element. What a waste! What if the iterator was yielding elements at the same time as the for loop was using them? Well, for every for loop iteration, the iterator could generate ~2.2 elements. That’s what a caching iterator does. It runs both at the same time using multiprocessing. It’s thread safe as long as the iterator doesn’t depend on whatever is using it. An example:

def my_iterator():
    for i in range(0,5):
        time.sleep(0.2)
        yield i

for item in itertools.CachingIterator(my_iterator()): # this is the only change
    time.sleep(0.5)
    print(item)

Now the flow is like this:

  *   Wait for my_iterator to return the very first element.
  *   While that first element is looped over, continue recieving elements from my_iterator(), storing them in an intermediate space(similar to a deque).
  *   When the loop is completed, take the next element from the intermediate space and loop over it
  *   While that element is looped over, continue recieving elements…

…and so forth. That way, time isn’t wasted waited for the loop to finish.

I have a working implementation. Although there is a very slight overhead, in the above example, about 0.4s is still saved.

There could also be an lmap function, which just does this:

def lmap(f,it):
    yield from map(f,CachingIterator(it))

Thoughts?

--
Ryan
If anybody ever asks me why I prefer C++ to C, my answer will be simple: "It's becauseslejfp23(@#Q*(E*EIdc-SEGFAULT. Wait, I don't think that was nul-terminated."

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20140225/6e34e62f/attachment.html>