[Python-Dev] Re: Reiterability

Mon Oct 20 03:40:43 EDT 2003

On Monday 20 October 2003 02:40 am, Guido van Rossum wrote:
   ...
> > Perhaps I over-abstracted it, but I just love abstracting streams as
> > iterators whenever I can get away with it -- I love the clean,
> > reusable program structure I often get that way, I love the reusable
> > functions it promotes.
>
> But when you add more behavior to the iterator protocol, a lot of the
> cleanliness goes away; any simple transformation of an iterator using
> a generator function loses all the optional functionality.

It loses the optimization on clonability, only, as far as I can see; i.e.
cloning becomes potentially memory-expensive if what I'm cloning
(tee-ing, whatever) can't give me an optimized way.  I can still code
the higher levels based on clean "tee-able streams", and possibly
optimize some iterator-factories later if profiling shows they're needed
(yet another case where one dreams of a way of profiling MEMORY
use, oh well:-).

BTW, playing around with some of this it seems to me that the
inability to just copy.copy (or copy.deepcopy) anything produced
by iter(sequence) is more of a bother -- quite apart from clonability
(a similar but separate concept), couldn't those iterators be
copy'able anyway?  I.e. just expose underlying sequence and
index as their state for getting and setting?  Otherwise to get
copyable iterators I have to reimplement iter "by hand":

class Iter(object):
    def __init__(self, seq):
        self.seq = seq
        self.idx = 0
    def __iter__(self): return self
    def next(self):
        try: result = self.seq[self.idx]
        except IndexError: raise StopIteration
        self.idx += 1
        return result

and I don't understand the added value of requiring the user to
code this no-added-value, slow-things-down boilerplate.

> > I guess I'll just build my iterators by suitable factory functions
> > (including "optimized tee-ability" when feasible), tweak Raymond's
> > "tee" to use "optimized tee-ability" when supplied, and tell my
> > clients to build the iterators with my factories if they need
> > memory-optimal tee-ing.  As long as I can't share that code more
> > widely, having to use e.g. richiters.iter instead of the built-in
> > iter isn't too bad, anyway.
>
> But you can't get the for-loop to use richiters.iter (you'd have to
> add an explicit call to it).  And you can't use any third party or

No problem, as the iterator built by the for loop is not exposed
in a way that would ever let me try to tee it anyway.

> standard library code for manipulating iterators; you'd have to write
> your own clone of itertools.

For those itertools functions that may preserve "cheap tee-ability"
only, yes.

> > > makes sense.  For one, the generic approach to cloning if the
> > > iterator doesn't have __clone__ would be to make a memory copy,
> > > but in this app a disk copy is desirable (I can invent something
> > > that overflows to
> >
> > An iterator that knows it's coming from disk or pipe can provide
> > that disk copy (or reuse the existing file) as part of its
> > "optimized tee-ability".
>
> At considerable cost.

I'm not sure I see that cost, yet.

> > > offset), or each clone must keep a file offset, but now you lose
> > > the performance effect of a streaming buffer unless you code up
> > > something extremely hairy with locks etc.
> >
> > ??? when one clone iterates to the end, on a read-only disk file,
> > its seeks (which happen always to be to the current offset) don't
> > remove the benefits of read-ahead done on its behalf by the OS.
> > Maybe you mean something else by "lose the performance effect"?
>
> I wasn't thinking about the OS read-ahead, I was thinking of stdio
> buffering, and the additional buffering done by file.next().  (See
> readahead_get_line_skip() in fileobject.c.)  This buffering has made
> "for line in file" in 2.3 faster than any way of iterating over the

Ah, if you're iterating by LINE, yes.  I was iterating by fixed-size
blocks on binary files in my tests, so I didn't see that effect.

> lines of a file previously available.  Also, on many systems, every
> call to fseek() drops the stdio buffer, even if the seek position is
> not actually changed by the call.  It could be done, but would require
> incredibly hairy code.

The call to fseek probably SHOULD drop the buffer in a typical
C implementation _on a R/W file_, because it's used as the way
to signal the file that you're moving from reading to writing or VV
(that's what the C standard says: you need a seek between an
input op and an immediately successive output op or viceversa,
even a seek to the current point, else, undefined behavior -- which
reminds me, I don't know if the _Python_ wrapper maintains that
"clever" requirement for ITS R/W files, but I think it does).  I can
well believe that for simplicity a C-library implementor would then
drop the buffer on a R/O file too, needlessly but understandably.

So, hmmm, wouldn't it suffice to guard the seek call with a
condition that the current point in the file isn't already what we want...?
[testing, testing...] nope, even just the guard slows things down
a LOT.  Hmmm, I think .tell IS implemented by a "dummy" .seek,
isn't it?  So, yes, quite some hairiness (credible or not;-) would be
needed to make an iterated-by-lines file good for optimized
tee-ability.

> > As for locks, why?  An iterator in general is not thread-safe: if
> > two threads iterate on the same iterator, without providing their
> > own locking, boom.  So why should clones imply stricter
> > thread-safety?
>
> I believe I was thinking of something else; the various iterators
> iterating over the same file would somehow have to communicate to each
> other who used the file last, so that repeated next() calls on the
> same iterator could know they wouldn't have to call seek() and hence
> lose the readahead buffer.  This doesn't require locking in the thread
> sense, but feels similar.

Interesting intuition.  The "who used this last" code doesn't feel
similar to a lock, to me: i.e., just transforming a plain iterator

class Lines1(object):
    def __init__(self, f):
        self.f = f
    def __iter__(self): return self
    def next(self):
        line = self.f.next()
        return line

into a somewhat more complicated one:

class Lines(object):
    wholast = {}
    def __init__(self, f):
        self.f = f
        self.wp = f.tell()
    def __iter__(self): return self
    def next(self):
        if self.wholast.get(self.f) is not self:
            self.f.seek(self.wp)
            self.wholast[self.f] = self
        line = self.f.next()
        self.wp += len(line)
        return line

(assuming seek "resyncs").  However, a loop using Lines (over
/usr/share/dict/words) [though twice as fast as my previous
attempt using tell each time] is over twice as slow as one with 
Lines1, which in turn is 3 times as slow as with a tiny generator:

def Lines2(flob):
    for line in flob: yield line

The deuced "for line in flob:" is so deucedly optimized that trying
to compete with it, even with something as apparently trivial as
Lines1, is apparently a lost cause;-).  OK, then I guess that an
iterator by lines on a textfile can't easily be optimized for teeability
by these "share the file object" strategies; rather, the best way to
tee such a disk file would seem to be:

def tee_diskfile(f):
    result = file(f.name, f.mode)
    result.seek(f.tell())
    return f, result

Alex