need help on need help on generator...

Sat Jan 22 04:46:15 EST 2005

On Sat, 2005-01-22 at 10:10 +0100, Alex Martelli wrote:

> The answer for the current implementation, BTW, is "in between" -- some
> buffering, but bounded consumption of memory -- but whether that tidbit
> of pragmatics is part of the file specs, heh, that's anything but clear
> (just as for other important tidbits of Python pragmatics, such as the
> facts that list.sort is wickedly fast, 'x in alist' isn't, 'x in adict'
> IS...).

A particularly great example when it comes to unexpected buffering
effects is the file iterator. Take code that reads a header from a file
using an (implicit) iterator, then tries to read() the rest of the file.
Taking the example of reading an RFC822-like message into a list of
headers and a body blob:

.>>> inpath = '/tmp/msg.eml'
.>>> infile = open(inpath)
.>>> for line in infile:
....    if not line.strip():
....            break
....    headers.append(tuple(line.split(':',1)))
.>>> body = infile.read()

(By the way, if you ever implement this yourself for real, you should
probably be hurt - use the 'email' or 'rfc822' modules instead. For one
thing, reinventing the wheel is rarely a good idea. For another, the
above code is horrid - in particular it doesn't handle malformed headers
at all, isn't big on readability/comments, etc.)

If you run the above code on a saved email message, you'd expect 'body'
to contain the body of the message, right? Nope. The iterator created
from the file when you use it in that for loop does internal read-ahead
for efficiency, and has already read in the entire file or at least a
chunk more of it than you've read out of the iterator. It doesn't
attempt to hide this from the programmer, so the file position marker is
further into the file (possibly at the end on a smaller file) than you'd
expect given the data you've actually read in your program.

I'd be interested to know if there's a better solution to this than:

.>>> inpath = '/tmp/msg.eml'
.>>> infile = open(inpath)
.>>> initer = iter(infile)
.>>> headers = []
.>>> for line in initer:
....     if not line.strip():
....             break
....     headers.append(tuple(line.split(':',1)))
.>>> data = ''.join(x for x in initer)

because that seems like a pretty ugly hack (and please ignore the
variable names). Perhaps a way to get the file to seek back to the point
last read from the iterator when the iterator is destroyed?

-- 
Craig Ringer