[Python-3000] revamping the io stack, part 2

Sun Apr 30 17:18:41 CEST 2006

> I don't
> want to make the 90% case require hardly any memorizing of what
> readers I need in what order.
...
> See, this is what I am worried about.  I **really** like not having to
> figure out what I need to do to read by lines from a file.  If the
> FileStream object had an __iter__ that did the proper wrapping with
> LinedBufferedStream, then great, I'm happy.  But if we do not add some
> reasonable convenience functions or iterators

yes, i totally agree with that: we do need convenience functions.
take a look at this:

def file(filename, mode = "r", bufsize = None):
    # open the file
    f = FileStream(filename, "r")
    # add buffering if requested
    if bufsize is not None:
        f = BufferedStream(f, bufsize)
    return f

def textfile(filename, *args):
    # add "text mode"
    return TextCodec(file(filename))

so today's file() remains in tact, but is accompanied by a textfile()
counterpart, that opens in textmode. or we could add a "t" mode to
the file's mode list, but that's ugly.

and the TextCodec adds __iter__ over lines, so you *can* do
for line in textfile("c:\\blah"):
    pass

but not
for line in file("c:\\blah"):
    pass

because only text files have the notion of lines. i think you'd agree
that it's meaningless to iterate by *lines* over arbitrary streams, like
binary files or whatever. so it should be explicit, that you want to
treat the data as text. and don't forget line translation could corrupt
your binary data, etc.

> this is going to feel
> rather heavy-handed rather quickly.
so i hope it doesn't seem heavy or too complex now.

greetings,
-tomer

On 4/29/06, Brett Cannon <brett at python.org> wrote:
> On 4/29/06, tomer filiba <tomerfiliba at gmail.com> wrote:
> > i first thought on focusing on the socket module, because it's the part that
> > bothers me most, but since people have expressed their thoughts on
> > completely
> > revamping the IO stack, perhaps we should be open to adopting new ideas,
> > mainly from the java/.NET world (keeping the momentum from the previous
> > post).
> >
> > there is an inevitable issue of performance here, since it basically splits
> > what used to be "file" or "socket" into many layers... each adding
> > additional
> > overhead, so many parts should be lowered to C.
> >
> > if we look at java/.NET for guidance, they have come up with two concepts:
>
> I am a little weary of taking too much from Java/.NET since I have
> always found the I/O system way too heavy for the common case.  I
> can't remember what it takes to get a reader in Java in order to read
> by lines.  In Python, I love that I don't have to think about that;
> just pass a file object to 'for' and I am done.
>
> While I am all for allowing for more powerful I/O through stacking a
> stream within various readers (which feels rather functional to me,
> but that must just be because of my latest reading material), I don't
> want to make the 90% case require hardly any memorizing of what
> readers I need in what order.
>
> > * stream - an arbitrary, usually sequential, byte data source
> > * readers and writers - the way data is encoded into/decoded from the
> > stream.
> > we'll use the term "codec" for these readers and writers in general.
> >
> > so "stream" is the "where" and "codec" is the "how", and the concept of
> > codecs is not limited to ASCII vs UTF-8. it can grow into fully-fledged
> > protocols.
> [SNIP - a whole lot of detailed ideas]
> > -----
> >
> > buffering is always *explicit* and implemented at the interpreter level,
> > rather than by libc, so it is consistent between all platforms and streams.
> > all streams, by nature, and *non-buffered* (write the data as soon as
> > possible). buffering wraps an underlying stream, making it explicit
> >
> > class BufferedStream(Stream):
> >     def __init__(self, stream, bufsize)
> >     def flush(self)
> >
> > (BufferedStream appears in .NET)
> >
> > class LineBufferedStream(BufferedStream):
> >     def __init__(self, stream, flush_on = b"\n")
> >
> > f = LineBufferedStream(FileStream("c:\\blah"))
> >
> > where flush_on specifies the byte (or sequence of bytes?) to flush upon
> > writing. by default it would be on newline.
> >
>
> See, this is what I am worried about.  I **really** like not having to
> figure out what I need to do to read by lines from a file.  If the
> FileStream object had an __iter__ that did the proper wrapping with
> LinedBufferedStream, then great, I'm happy.  But if we do not add some
> reasonable convenience functions or iterators, this is going to feel
> rather heavy-handed rather quickly.
>
> -Brett
>