[Python-3000] Draft PEP for New IO system
Giovanni Bajo
rasky at develer.com
Wed Feb 28 09:20:01 CET 2007
[reposting since the first time it didn't get through...]
On 26/02/2007 22.35, Mike Verdone wrote:
> Daniel Stutzbach and I have prepared a draft PEP for the new IO system
> for Python 3000. This document is, hopefully, true to the info that
> Guido wrote on the whiteboards here at PyCon. This is still a draft
> and there's quite a few decisions that need to be made. Feedback is
> welcomed.
Thanks for this!
> Raw I/O
> The abstract base class for raw I/O is RawIOBase. It has several
> methods which are wrappers around the appropriate operating system
> call. If one of these functions would not make sense on the object,
> the implementation must raise an IOError exception. For example, if a
> file is opened read-only, the .write() method will raise an IOError.
> As another example, if the object represents a socket, then .seek(),
> .tell(), and .truncate() will raise an IOError.
>
> .read(n: int) -> bytes
> .readinto(b: bytes) -> int
> .write(b: bytes) -> int
What are the requirements here?
- Can read()/readinto() return *less* bytes than specified?
- Can read() return a 0-sized byte object (=no data available)?
- Can read() return *more* bytes than specified (think of a datagram socket or
a decompressing stream)?
- Can readinto() read *less* bytes than specified?
- Can readinto() read zero bytes?
- Should read()/readinto() raise EOFError?
- Can write() write less bytes than specified?
- Can write() write zero bytes?
Please, see also the examples at the end of the mail before providing an answer :)
> .seek(pos: int, whence: int = 0) -> None
> .tell() -> int
> .truncate(n: int = None) -> None
> .close() -> None
Why should this very low-level basic type define *two* read methods? Assuming
that readinto() is the most primitive, can we have the ABC RawIOBase provide a
default read() method that calls readinto?
Consider providing more ABC/mixins to help implementations.
ReadIOBase/WriteIOBase are pretty obvious:
class RawIOBase:
def readable(self): return False
def writeable(self): return False
def seekable(self): return False
def read(self,n): raise IOError
def readinto(self,b): raise IOError
def write(self,b): raise IOError
def seek(self,pos,wh): raise IOError
def tell(self): raise IOError
def truncate(self,n=None): raise IOError
class ReadIOBase(RawIOBase):
def readable(self): return True
def read(self, n):
b = bytes(n) #whatever
self.readinto(b)
return b
class MySpecialReader(ReadIOBase):
def readinto(self, b):
# ....
# must implement only this and nothing else
class MySpecialReaderWriter(ReadIOBase, WriteIOBase):
def readinto(self, b):
# ....
def write(self, b):
# ....
> (should these "is_" functions be attributes instead?
> "file.readable == True")
Yes, I think readable/writeable/seekable/fileno *perfectly* match the good
usage of attributes/properties. They all provide a value without any
side-effect and that can be computed without doing O(n)-style computations.
> Buffered I/O
> The next layer is the Buffer I/O layer which provides more efficient
> access to file-like objects. The abstract base class for all Buffered
I think you probably want the buffer size to be optionally specified by the
user, for the standard 4 implementations.
> Q: Do we want to mandate in the specification that switching between
> reading to writing on a read-write object implies a .flush()? Or is
> that an implementation convenience that users should not rely on?
I'd be glad if using flush() wasn't a requirement for users of the class. It
always strikes me as abstraction leak to me.
> TextIOBase class implementations additionally provide the following methods:
>
> .readline(self)
>
> Read until newline or EOF and return the line.
>
> .readlinesiter()
>
> Returns an iterator that returns lines from the file (which
> happens to be 'self').
>
> .next()
>
> Same as readline()
>
> .__iter__()
>
> Same as readlinesiter()
Note sure why you need "readlinesiter()" at all. I thought Py3k was disposing
most of the "fooiter()" functions (thinking of dicts...).
> Another way to do it is as follows (we should pick one or the other):
>
> .__init__(self, buffer, encoding=None, newline=None)
I think this is clearer. I can't find a good real-world usecase for requiring
the two parameters version.
==========================================================================
Now for some real example. Let's say I'm given a readable RawIOBase object.
I'm told that it's a foobar-compressed utf-8 text-file. I have this API available:
class Foobar:
# initialize decompressor
__init__()
# feed compressed bytes and get uncompressed bytes.
# The uncompressed data can be smaller, equal or larger
# than the compressed data
decompress(bytes) -> bytes
# finish decompression and get tail
flush() -> bytes
This is basically similar to the way zlib.decompress/flush works. I would like
to wrap the readable RawIOBase object in a way that I obtain a textual
file-like with readline() etc.
This is pretty hard to do with the current I/O library (you need to write a
lot of code). It'd be good if the new I/O library makes it easier to achieve.
Let's see. I start with a raw I/O reader:
class FoobarRaw(RawIOBase):
def __init__(self, raw):
self.raw = raw
self._d = Foobar()
self._buf = bytes()
def readable(self):
return True
# I assume RawIOBase.read() must return the
# exact number of bytes (unless at the end).
# I assume RawIOBase.read() raises EOFError when done
# I assume readinto() does not exist...
def read(self, n):
try:
while len(self._buf) < n:
b = self.raw.read(n)
self._buf += self._d.decompress(b)
except EOFError:
self._buf += self._d.flush()
d = self._buf[:n]
del self._buf[:n]
if not d:
raise EOFError
return d
and complete the job:
def foobar_open(raw):
return TextIOWrapper(BufferedReader(FoobarRaw(raw)), encoding="utf-8")
for L in foobar_open(sock):
print(L)
Uhm, looks great!
==========================================================================
Now, it might be interesting playing with the different semantic of
RawIOBase.read(), which I proposed above, and see how the implementation of
FoobarRaw.read() changes.
For instance (now being radical): why don't we drop the "n" argument
altogether? We could just define it like this:
# Returns a block of data, whose size is implementation-defined
# and may vary between calls. It never returns a zero-sized block.
# Raises EOFError when done.
read() -> bytes
After all, there's a BufferedIO layer to handle buffering and exact-size
reads/writes. If we go this way, the above example is even easier:
def read(self):
try:
b = self.raw.read() # any size!
return self._d.decompress(b)
except EOFError:
b = self._d.flush()
if not b:
raise EOFError
return b
It would also work well for sockets, since they would return exactly the
buffer of data arrived from the network, and simply block once if there's not
data available.
--
Giovanni Bajo
More information about the Python-3000
mailing list