[Python-Dev] Re: Reiterability

Sun Oct 19 20:40:37 EDT 2003

> > > I have an iterator it whose items, after an arbitrary prefix
> > > terminated by the first empty item, are supposed to be each
> > > 'yes' or 'no'.
> >
> > This is a made-up toy example, right?  Does it correspond with
> > something you've had to do in real life?
> 
> Yes, but I signed an NDA, and thus made irrelevant changes
> sufficient to completely mask the application area &c (how is the
> prefix's end is found, how the rest of the stream is analyzed to
> determine how to process it).

OK, but that does make it harder to judge its value for making the
case for iterator cloning, because you're not saying anything about
the (range of) characteristics of the input iterator.

> > But I'm not sure that abstracting this away all the way to an iterator
> 
> Perhaps I over-abstracted it, but I just love abstracting streams as
> iterators whenever I can get away with it -- I love the clean,
> reusable program structure I often get that way, I love the reusable
> functions it promotes.

But when you add more behavior to the iterator protocol, a lot of the
cleanliness goes away; any simple transformation of an iterator using
a generator function loses all the optional functionality.

> I guess I'll just build my iterators by suitable factory functions
> (including "optimized tee-ability" when feasible), tweak Raymond's
> "tee" to use "optimized tee-ability" when supplied, and tell my
> clients to build the iterators with my factories if they need
> memory-optimal tee-ing.  As long as I can't share that code more
> widely, having to use e.g. richiters.iter instead of the built-in
> iter isn't too bad, anyway.

But you can't get the for-loop to use richiters.iter (you'd have to
add an explicit call to it).  And you can't use any third party or
standard library code for manipulating iterators; you'd have to write
your own clone of itertools.

> > makes sense.  For one, the generic approach to cloning if the
> > iterator doesn't have __clone__ would be to make a memory copy,
> > but in this app a disk copy is desirable (I can invent something
> > that overflows to
> 
> An iterator that knows it's coming from disk or pipe can provide
> that disk copy (or reuse the existing file) as part of its
> "optimized tee-ability".

At considerable cost.

> > offset), or each clone must keep a file offset, but now you lose
> > the performance effect of a streaming buffer unless you code up
> > something extremely hairy with locks etc.
> 
> ??? when one clone iterates to the end, on a read-only disk file,
> its seeks (which happen always to be to the current offset) don't
> remove the benefits of read-ahead done on its behalf by the OS.
> Maybe you mean something else by "lose the performance effect"?

I wasn't thinking about the OS read-ahead, I was thinking of stdio
buffering, and the additional buffering done by file.next().  (See
readahead_get_line_skip() in fileobject.c.)  This buffering has made
"for line in file" in 2.3 faster than any way of iterating over the
lines of a file previously available.  Also, on many systems, every
call to fseek() drops the stdio buffer, even if the seek position is
not actually changed by the call.  It could be done, but would require
incredibly hairy code.

> As for locks, why?  An iterator in general is not thread-safe: if
> two threads iterate on the same iterator, without providing their
> own locking, boom.  So why should clones imply stricter
> thread-safety?

I believe I was thinking of something else; the various iterators
iterating over the same file would somehow have to communicate to each
other who used the file last, so that repeated next() calls on the
same iterator could know they wouldn't have to call seek() and hence
lose the readahead buffer.  This doesn't require locking in the thread
sense, but feels similar.

--Guido van Rossum (home page: http://www.python.org/~guido/)