[Python-3000] Draft PEP for New IO system

Giovanni Bajo rasky at develer.com
Wed Feb 28 09:20:01 CET 2007

[reposting since the first time it didn't get through...]

On 26/02/2007 22.35, Mike Verdone wrote:

 > Daniel Stutzbach and I have prepared a draft PEP for the new IO system
 > for Python 3000. This document is, hopefully, true to the info that
 > Guido wrote on the whiteboards here at PyCon. This is still a draft
 > and there's quite a few decisions that need to be made. Feedback is
 > welcomed.

Thanks for this!

 > Raw I/O
 > The abstract base class for raw I/O is RawIOBase.  It has several
 > methods which are wrappers around the appropriate operating system
 > call.  If one of these functions would not make sense on the object,
 > the implementation must raise an IOError exception.  For example, if a
 > file is opened read-only, the .write() method will raise an IOError.
 > As another example, if the object represents a socket, then .seek(),
 > .tell(), and .truncate() will raise an IOError.
 >    .read(n: int) -> bytes
 >    .readinto(b: bytes) -> int
 >    .write(b: bytes) -> int

What are the requirements here?

- Can read()/readinto() return *less* bytes than specified?
- Can read() return a 0-sized byte object (=no data available)?
- Can read() return *more* bytes than specified (think of a datagram socket or 
a decompressing stream)?
- Can readinto() read *less* bytes than specified?
- Can readinto() read zero bytes?
- Should read()/readinto() raise EOFError?
- Can write() write less bytes than specified?
- Can write() write zero bytes?

Please, see also the examples at the end of the mail before providing an answer :)

 >    .seek(pos: int, whence: int = 0) -> None
 >    .tell() -> int
 >    .truncate(n: int = None) -> None
 >    .close() -> None

Why should this very low-level basic type define *two* read methods? Assuming 
that readinto() is the most primitive, can we have the ABC RawIOBase provide a 
default read() method that calls readinto?

Consider providing more ABC/mixins to help implementations. 
ReadIOBase/WriteIOBase are pretty obvious:

class RawIOBase:
     def readable(self): return False
     def writeable(self): return False
     def seekable(self): return False

     def read(self,n): raise IOError
     def readinto(self,b): raise IOError
     def write(self,b): raise IOError
     def seek(self,pos,wh): raise IOError
     def tell(self): raise IOError
     def truncate(self,n=None): raise IOError

class ReadIOBase(RawIOBase):
     def readable(self): return True
     def read(self, n):
         b = bytes(n)  #whatever
         return b

class MySpecialReader(ReadIOBase):
     def readinto(self, b):
         # ....
         # must implement only this and nothing else

class MySpecialReaderWriter(ReadIOBase, WriteIOBase):
     def readinto(self, b):
         # ....
     def write(self, b):
         # ....

 >     (should these "is_" functions be attributes instead?
 > "file.readable == True")

Yes, I think readable/writeable/seekable/fileno *perfectly* match the good 
usage of attributes/properties. They all provide a value without any 
side-effect and that can be computed without doing O(n)-style computations.

 > Buffered I/O
 > The next layer is the Buffer I/O layer which provides more efficient
 > access to file-like objects. The abstract base class for all Buffered

I think you probably want the buffer size to be optionally specified by the 
user, for the standard 4 implementations.

 > Q: Do we want to mandate in the specification that switching between
 > reading to writing on a read-write object implies a .flush()?  Or is
 > that an implementation convenience that users should not rely on?

I'd be glad if using flush() wasn't a requirement for users of the class. It 
always strikes me as abstraction leak to me.

 > TextIOBase class implementations additionally provide the following methods:
 >     .readline(self)
 >        Read until newline or EOF and return the line.
 >     .readlinesiter()
 >        Returns an iterator that returns lines from the file (which
 > happens to be 'self').
 >     .next()
 >        Same as readline()
 >     .__iter__()
 >        Same as readlinesiter()

Note sure why you need "readlinesiter()" at all. I thought Py3k was disposing 
most of the "fooiter()" functions (thinking of dicts...).

 > Another way to do it is as follows (we should pick one or the other):
 >     .__init__(self, buffer, encoding=None, newline=None)

I think this is clearer. I can't find a good real-world usecase for requiring 
the two parameters version.


Now for some real example. Let's say I'm given a readable RawIOBase object. 
I'm told that it's a foobar-compressed utf-8 text-file. I have this API available:

     class Foobar:
        # initialize decompressor

        # feed compressed bytes and get uncompressed bytes.
        # The uncompressed data can be smaller, equal or larger
        # than the compressed data
        decompress(bytes) -> bytes

        # finish decompression and get tail
        flush() -> bytes

This is basically similar to the way zlib.decompress/flush works. I would like 
to wrap the readable RawIOBase object in a way that I obtain a textual 
file-like with readline() etc.

This is pretty hard to do with the current I/O library (you need to write a 
lot of code). It'd be good if the new I/O library makes it easier to achieve.

Let's see. I start with a raw I/O reader:

class FoobarRaw(RawIOBase):
     def __init__(self, raw):
         self.raw = raw
         self._d = Foobar()
         self._buf = bytes()

     def readable(self):
         return True

     # I assume RawIOBase.read() must return the
     #   exact number of bytes (unless at the end).
     # I assume RawIOBase.read() raises EOFError when done
     # I assume readinto() does not exist...
     def read(self, n):
             while len(self._buf) < n:
                 b = self.raw.read(n)
                 self._buf += self._d.decompress(b)
         except EOFError:
             self._buf += self._d.flush()

         d = self._buf[:n]
         del self._buf[:n]
         if not d:
             raise EOFError
         return d

and complete the job:

def foobar_open(raw):
     return TextIOWrapper(BufferedReader(FoobarRaw(raw)), encoding="utf-8")

for L in foobar_open(sock):

Uhm, looks great!


Now, it might be interesting playing with the different semantic of 
RawIOBase.read(), which I proposed above, and see how the implementation of 
FoobarRaw.read() changes.

For instance (now being radical): why don't we drop the "n" argument 
altogether? We could just define it like this:

     # Returns a block of data, whose size is implementation-defined
     # and may vary between calls. It never returns a zero-sized block.
     # Raises EOFError when done.
     read() -> bytes

After all, there's a BufferedIO layer to handle buffering and exact-size 
reads/writes. If we go this way, the above example is even easier:

     def read(self):
            b = self.raw.read() # any size!
            return self._d.decompress(b)
         except EOFError:
            b = self._d.flush()
            if not b:
               raise EOFError
            return b

It would also work well for sockets, since they would return exactly the 
buffer of data arrived from the network, and simply block once if there's not 
data available.
Giovanni Bajo

More information about the Python-3000 mailing list