[Python-3000] Comment on iostack library

Thu Aug 31 23:43:44 CEST 2006

i haven't been online for the last couple of days, so i'll unify
my replies into one post.

[Talin]
> Right now, a typical
> file handle consists of 3 "layers" - one representing the backing store
> (file, memory, network, etc.), one for adding buffering, and one
> representing the program-level API for reading strings, bytes, decoded
> text, etc.

yes, and it's also good you noted *typical*. the design is to
allow virtually unlimited number of such layers, stacked one
after the other, giving you very fine level of control without
having to write a single line of "procedural" or tailored code.
you just mix in what you want.

[Talin]
> I wonder if it wouldn't be better to cut that down to two. Specifically,
> I would like to suggest eliminating the buffering layer.
> My reasoning is fairly straightforward: Most file system handles,
> network handles and other operating system handles already support
> buffering, and they do a far better job of it than we can.

indeed, but as guido said (and i believe it also says so at my
wiki page), stdio cannot be trusted, let alone the way different
OSes implement things. buffering, for once, is a horrible issue.
i remember an old C program i wrote that worked fine on
windows, but not on linux, because i didn't print a newline and
stdout was line-buffered... i couldn't see the output, and it was
a nightmare to debug.

[Talin]
> Well, as far as readline goes: In order to split the text into lines,
> you have to decode the text first anyway, which is a layer 3 operation.
> You can't just read bytes until you get a \n, because the file you are
> reading might be encoded in UCS2 or something.

well, the LineBufferedLayer can be "configured" to split on any
"marker", i.e.: LineBufferedLayer(stream, marker = "\x00\x0a")
and of course layer 3, which creates layer 2, can set this marker
to any byte sequence. note it's a *byte* sequence, not chars,
since this passes down to layer 1 transparently.

i.e.

delimiters = {"utf8" : "\x0a", "utf16" : "\x00\x0a"}

def textfile(filename, mode, encoding = None):
    f = FileStream(filename, mode)
    f = LineBufferingLayer(f, delimiters[encoding])
    f = TextInterface(f, encoding)
    return f

[Talin]
> It seems to me that no matter how you slice it, you can't have an
> abstract "buffering" layer that is independent of both the layer beneath
> and the layer above.

but that's the whole idea! buffering is a complicated task that must
*not* be rewritten for every type of underlying storage. if one wanted
to write or read lines over a socket, one shouldn't have need to
reimplement file-like line buffering, as done by socket.py.

i want to be able to read lines directly from any stream: socket, file,
or memory. how i choose to implement my HTTP parser is my only
concern, i don't want to be limited by the kind of stream my parser
would work over.

[Nick]
> You'd insert a buffering layer at the appropriate point for whatever you're
> trying to do. The advantage of pulling the buffering out into a separate layer
> is that it can be reused with different byte sources & sinks by supplying the
> appropriate configuration parameters, instead of having to reimplement it for
> each different source/sink.

indeed

[Marcin]
> I think buffering makes sense as the topmost layer, and typically only
> there.
> Encoding conversion and newline conversion should be performed a block
> at a time, below buffering, so not only I/O syscalls, but also
> invocations of the recoding machinery are amortized by buffering.

you have a good point, which i also stumbled upon when implementing
the TextInterface. but how would you suggest to solve it?

write()ing is always simpler, because you already have the entire
buffer, which you can encode as a chunk.

when read()ing, you can decode() the entire pre-read buffer first,
but then you have a "tail" of undecodable data (an incomplete
character or record), which would be quite nasty to handle.

besides, encoding suffers from many issues. suppose you have a
damaged UTF8 file, which you read char-by-char. when we reach the
damaged part, you'll never be able to "skip" it, as we'll just keep
read()ing bytes, hoping to make a character out of it , until we
reach EOF, i.e.:

def read_char(self):
    buf = ""
    while not self._stream.eof:
        buf += self._stream.read(1)
        try:
            return buf.decode("utf8")
        except ValueError:
            pass

which leads me to the following thought: maybe we should have
an "enhanced" encoding library for py3k, which would report
*incomplete* data differently from *invalid* data. today it's just a
ValueError: suppose decode() would raise IncompleteDataError
when the given data is not sufficient to be decoded successfully,
and ValueError when the data is just corrupted.

that could aid iostack greatly.

-tomer