[Python-3000] Draft PEP for New IO system

Mon Feb 26 22:35:54 CET 2007

Hi all,

Daniel Stutzbach and I have prepared a draft PEP for the new IO system
for Python 3000. This document is, hopefully, true to the info that
Guido wrote on the whiteboards here at PyCon. This is still a draft
and there's quite a few decisions that need to be made. Feedback is
welcomed.

We've published it on Google Docs here:
http://docs.google.com/Doc?id=dfksfvqd_1cn5g5m

What follows is a plaintext version.

Thanks,

Mike.

PEP: XXX
Title: New IO
Version:
Last-Modified:
Authors: Daniel Stutzbach, Mike Verdone
Status: Draft
Type:
Created: 26-Feb-2007

Rationale and Goals
Python allows for a variety of file-like objects that can be worked
with via bare read() and write() calls using duck typing. Anything
that provides read() and write() is stream-like. However, more exotic
and extremely useful functions like readline() or seek() may or may
not be available on a file-like object. Python needs a specification
for basic byte-based IO streams to which we can add buffering and
text-handling features.

Once we have a defined raw byte-based IO interface, we can add
buffering and text-handling layers on top of any byte-based IO class.
The same buffering and text handling logic can be used for files,
sockets, byte arrays, or custom IO classes developed by Python
programmers. Developing a standard definition of a stream lets us
separate stream-based operations like read() and write() from
implementation specific operations like fileno() and isatty(). It
encourages programmers to write code that uses streams as streams and
not require that all streams support file-specific or socket-specific
operations.

The new IO spec is intended to be similar to the Java IO libraries,
but generally less confusing. Programmers who don't want to muck about
in the new IO world can expect that the open() factory method will
produce an object backwards-compatible with old-style file objects.
Specification
The Python I/O Library will consist of three layers: a raw I/O layer,
a buffer I/O layer, and a text I/O layer.  Each layer is defined by an
abstract base class, which may have multiple implementations.  The raw
I/O and buffer I/O layers deal with units of bytes, while the text I/O
layer deals with units of characters.
Raw I/O
The abstract base class for raw I/O is RawIOBase.  It has several
methods which are wrappers around the appropriate operating system
call.  If one of these functions would not make sense on the object,
the implementation must raise an IOError exception.  For example, if a
file is opened read-only, the .write() method will raise an IOError.
As another example, if the object represents a socket, then .seek(),
.tell(), and .truncate() will raise an IOError.

    .read()
    .write()
    .seek()
    .tell()
    .truncate()
    .close()

Additionally, it defines a few other methods:

    (should these "is_" functions be attributes instead?
"file.readable == True")

    .is_readable()

       Returns True if the object was opened for reading, False
otherwise.  If False, .read() will raise an IOError if called.

    .is_writable()

       Returns True if the object was opened write writing, False
otherwise.  If False, .write() and .truncate() will raise an IOError
if called.

    .is_seekable()  (Should this be called .is_random()?  or
.is_sequential() with opposite return values?)

       Returns True if the object supports random-access (such as disk
files), or False if the object only supports sequential access (such
as sockets, pipes, and ttys).  If False, .seek(), .tell(), and
.truncate() will raise an IOError if called.

Iff a RawIOBase implementation operates on an underlying file
descriptor, it must additionally provide a .fileno() member function.
This could be defined specifically by the implementation, or a mix-in
class could be used (Need to decide about this).

    .fileno()

       Returns the underlying file descriptor (an integer)

Initially, three implementations will be provided that implement the
RawIOBase interface: FileIO, SocketIO, and ByteIO (also MMapIO?).
Each implementation must determine whether the object supports random
access as the information provided by the user may not be sufficient
(consider open("/dev/tty", "rw") or open("/tmp/named-pipe", "rw").  As
an example, FileIO can determine this by calling the seek() system
call; if it returns an error, the object does not support random
access.  Each implementation may provided additional methods
appropriate to its type.  The ByteIO object is analogous to Python 2's
cStringIO library, but operating on the new bytes type instead of
strings.
Buffered I/O
The next layer is the Buffer I/O layer which provides more efficient
access to file-like objects.  The abstract base class for all Buffered
I/O implementations is BufferedIOBase, which provides similar methods
to RawIOBase:

    .read()
    .write()
    .seek()
    .tell()
    .truncate()
    .close()
    .is_readable()
    .is_writable()
    .is_seekable()

Additionally, the abstract base class provides one member variable:

    .raw

       Provides a reference to the underling RawIOBase object.

The BufferIOBase methods' syntax is identical to that of RawIOBase,
but may have different semantics.  In particular, BufferIOBase
implementations may read more data than requested or delay writing
data using buffers.  For the most part, this will be transparent to
the user (unless, for example, they open the same file through a
different descriptor).

There are four implementations of the BufferIOBase abstract base
class, described below.
BufferedReader
The BufferedReader implementation is for sequential-access read-only
objects.  It does not provide a .flush() method, since there is no
sensible circumstance where the user would want to discard the read
buffer.
BufferedWriter
The BufferedWriter implementation is for sequential-access write-only
objects.  It provides a .flush() method, which forces all cached data
to be written to the underlying RawIOBase object.
BufferedRWPair
The BufferRWPair implementation is for sequential-access read-write
objects such as sockets and ttys.  As the read and write streams of
these objects are completely independent, it could be implemented by
simply incorporating a BufferedReader and BufferedWriter instance.  It
provides a .flush() method that has the same semantics as a
BufferWriter's .flush() method.
BufferedRandom
The BufferRandom implementation is for all random-access objects,
whether they are read-only, write-only, or read-write.  Compared to
the previous classes that operate on sequential-access objects, the
BufferedRandom class must contend with the user calling .seek() to
reposition the stream.  Therefore, an instance of BufferRandom must
keep track of both the logical and true position within the object.
It provides a .flush() method that forces all cached write data to be
written to the underlying RawIOBase object and all cached read data to
be forgotten (so that future reads are forced to go back to the disk).

Q: Do we want to mandate in the specification that switching between
reading to writing on a read-write object implies a .flush()?  Or is
that an implementation convenience that users should not rely on?

For a read-only BufferRandom object, .is_writable() returns False and
the .write() and .truncate() methods throw IOError.

For a write-only BufferRandom object, .is_readable() returns False and
the .read() method throws IOError.
Text I/O
The text I/O layer provides functions to read and write strings from
streams. Some new features include universal newlines and character
set encoding and decoding.  The Text I/O layer is defined by a
TextIOBase abstract base class.  It provides several methods that are
similar to the BufferIOBase methods, but operate on a per-character
basis instead of a per-byte basis.  These methods are:

    .read()
    .write()
    .seek()
    .tell()
    .truncate()

TextIOBase implementations also provide several methods that are
pass-throughs to the underlaying BufferIOBase objects:

    .close()
    .is_readable()
    .is_writable()
    .is_seekable()

TextIOBase class implementations additionally provide the following methods:

    .readline(self)

       Read until newline or EOF and return the line.

    .readlinesiter()

       Returns an iterator that returns lines from the file (which
happens to be 'self').

    .next()

       Same as readline()

    .__iter__()

       Same as readlinesiter()

    .__enter__()

       Context management protocol. Returns self.

    .__exit__()

       Context management protocol. No-op.

Two implementations will be provided by the Python library.  The
primary implementation, TextIOWrapper, wraps a Buffered I/O object.
Each TextIOWrapper object has a property name ".buffer" that provides
a reference to the underlying BufferIOBase object.  It's initializer
has the following signature:

    .__init__(self, buffer, encoding=None, universal_newlines=True, crlf=None)

       Buffer is a reference to the BufferIOBase object to be wrapped
with the TextIOWrapper.  "Encoding" refers to an encoding to be used
for translating between the byte-representation and
character-representation.  If "None", then the system's locale setting
will be used as the default.  If "universal_newlines" is true, then
the TextIOWrapper will automatically translate the bytes "\r\n" into a
single newline character during reads.  If "crlf" is False, then a
newline will be written as "\r\n".  If "crlf" is True, then a newline
will be written as "\n".  If "crlf" is None, then a system-specific
default will be used.

Another way to do it is as follows (we should pick one or the other):

    .__init__(self, buffer, encoding=None, newline=None)

       Same as above but if newline is not None use that as the
newline pattern (for reading and writing), and if newline is not set
attempt to find the newline pattern from the file and if we can't for
some reason use the system default newline pattern.

Another implementation, StringIO, creates a file-like TextIO
implementation without an underlying Buffer I/O object.  While similar
functionality could be provided by wrapping a BytesIO object in a
Buffered I/O object in a TextIOWrapper, the String I/O object allows
for much greater efficiency as it does not need to actually performing
encoding and decoding.  A String I/O object can just store the encoded
string as-is.  The String I/O object's __init__ signature is similar
to the TextIOWrapper, but without the "buffer" parameter.

END OF PEP