[Python-3000] Last call for PEP 3137: Immutable Bytes and Mutable Buffer

Mon Oct 1 01:25:20 CEST 2007

Thanks all for the focused and helpful discussion on this PEP. Here's
a new posting of the full text of the PEP as it now stands. Most of
the changes since the first posting are fleshing out of some details;
the decision to make the individual elements of bytes and buffer be
ints; and the decision to change bytes/str and buffer/str comparisons
again to just return False instead of raising TypeError.

(I'm not favorable towards the proposal of c'x' style literals or
changes to the I/O APIs to use different names for calls involving
bytes instead of text. If you still disagree, please start a new
thread with new subject line.)

I plan to accept the PEP within a day or two barring major objections,
and expect to start implementing soon after.

--Guido

PEP: 3137
Title: Immutable Bytes and Mutable Buffer
Version: $Revision: 58290 $
Last-Modified: $Date: 2007-09-30 16:19:14 -0700 (Sun, 30 Sep 2007) $
Author: Guido van Rossum <guido at python.org>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 26-Sep-2007
Python-Version: 3.0
Post-History: 26-Sep-2007, 30-Sep-2007

Introduction
============

After releasing Python 3.0a1 with a mutable bytes type, pressure
mounted to add a way to represent immutable bytes.  Gregory P. Smith
proposed a patch that would allow making a bytes object temporarily
immutable by requesting that the data be locked using the new buffer
API from PEP 3118.  This did not seem the right approach to me.

Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to
make the bytes type immutable (by crudely removing all mutating APIs)
and fix the fall-out in the test suite.  This showed that there aren't
all that many places that depend on the mutability of bytes, with the
exception of code that builds up a return value from small pieces.

Thinking through the consequences, and noticing that using the array
module as an ersatz mutable bytes type is far from ideal, and
recalling a proposal put forward earlier by Talin, I floated the
suggestion to have both a mutable and an immutable bytes type.  (This
had been brought up before, but until seeing the evidence of Jeffrey's
patch I wasn't open to the suggestion.)

Moreover, a possible implementation strategy became clear: use the old
PyString implementation, stripped down to remove locale support and
implicit conversions to/from Unicode, for the immutable bytes type,
and keep the new PyBytes implementation as the mutable bytes type.

The ensuing discussion made it clear that the idea is welcome but
needs to be specified more precisely.  Hence this PEP.

Advantages
==========

One advantage of having an immutable bytes type is that code objects
can use these.  It also makes it possible to efficiently create hash
tables using bytes for keys; this may be useful when parsing protocols
like HTTP or SMTP which are based on bytes representing text.

Porting code that manipulates binary data (or encoded text) in Python
2.x will be easier using the new design than using the original 3.0
design with mutable bytes; simply replace ``str`` with ``bytes`` and
change '...' literals into b'...' literals.

Naming
======

I propose the following type names at the Python level:

  - ``bytes`` is an immutable array of bytes (PyString)

  - ``buffer`` is a mutable array of bytes (PyBytes)

  - ``memoryview`` is a bytes view on another object (PyMemory)

The old type named ``buffer`` is so similar to the new type
``memoryview``, introduce by PEP 3118, that it is redundant.  The rest
of this PEP doesn't discuss the functionality of ``memoryview``; it is
just mentioned here to justify getting rid of the old ``buffer`` type
so we can reuse its name for the mutable bytes type.

While eventually it makes sense to change the C API names, this PEP
maintains the old C API names, which should be familiar to all.

Literal Notations
=================

The b'...' notation introduced in Python 3.0a1 returns an immutable
bytes object, whatever variation is used.  To create a mutable bytes
buffer object, use buffer(b'...') or buffer([...]).  The latter may
use a list of integers in range(256).

Functionality
=============

PEP 3118 Buffer API
-------------------

Both bytes and buffer implement the PEP 3118 buffer API.  The bytes
type only implements read-only requests; the buffer type allows
writable and data-locked requests as well.  The element data type is
always 'B' (i.e. unsigned byte).

Constructors
------------

There are four forms of constructors, applicable to both bytes and
buffer:

  - ``bytes(<bytes>)``, ``bytes(<buffer>)``, ``buffer(<bytes>)``,
    ``buffer(<buffer>)``: simple copying constructors, with the note
    that ``bytes(<bytes>)`` might return its (immutable) argument.

  - ``bytes(<str>, <encoding>[, <errors>])``, ``buffer(<str>,
    <encoding>[, <errors>])``: encode a text string.  Note that the
    ``str.encode()`` method returns an *immutable* bytes object.
    The <encoding> argument is mandatory; <errors> is optional.

  - ``bytes(<memory view>)``, ``buffer(<memory view>)``: construct a
    bytes or buffer object from anything implementing the PEP 3118
    buffer API.

  - ``bytes(<iterable of ints>)``, ``buffer(<iterable of ints>)``:
    construct an immutable bytes or mutable buffer object from a
    stream of integers in range(256).

  - ``buffer(<int>)``: construct a zero-initialized buffer of a given
    length.

Comparisons
-----------

The bytes and buffer types are comparable with each other and
orderable, so that e.g. b'abc' == buffer(b'abc') < b'abd'.

Comparing either type to a str object for equality returns False
regardless of the contents of either operand.  Ordering comparisons
with str raise TypeError.  This is all conformant to the standard
rules for comparison and ordering between objects of incompatible
types.

(**Note:** in Python 3.0a1, comparing a bytes instance with a str
instance would raise TypeError, on the premise that this would catch
the occasional mistake quicker, especially in code ported from Python
2.x.  However, a long discussion on the python-3000 list pointed out
so many problems with this that it is clearly a bad idea, to be rolled
back in 3.0a2 regardless of the fate of the rest of this PEP.)

Slicing
-------

Slicing a bytes object returns a bytes object.  Slicing a buffer
object returns a buffer object.

Slice assignment to a mutable buffer object accept anything that
implements the PEP 3118 buffer API, or an iterable of integers in
range(256).

Indexing
--------

Indexing bytes and buffer returns small ints (like the bytes type in
3.0a1, and like lists or array.array('B')).

Assignment to an item of a mutable buffer object accepts an int in
range(256).  (To assign from a bytes sequence, use a slice
assignment.)

Str() and Repr()
----------------

The str() and repr() functions return the same thing for these
objects.  The repr() of a bytes object returns a b'...' style literal.
The repr() of a buffer returns a string of the form "buffer(b'...')".

Operators
---------

The following operators are implemented by the bytes and buffer types,
except where mentioned:

  - ``b1 + b2``: concatenation.  With mixed bytes/buffer operands,
    the return type is that of the first argument (this seems arbitrary
    until you consider how ``+=`` works).

  - ``b1 += b2'': mutates b1 if it is a buffer object.

  - ``b * n``, ``n * b``: repetition; n must be an integer.

  - ``b *= n``: mutates b if it is a buffer object.

  - ``b1 in b2``, ``b1 not in b2``: substring test; b1 can be any
    object implementing the PEP 3118 buffer API.

  - ``i in b``, ``i not in b``: single-byte membership test; i must
    be an integer (if it is a length-1 bytes array, it is considered
    to be a substring test, with the same outcome).

  - ``len(b)``: the number of bytes.

  - ``hash(b)``: the hash value; only implemented by the bytes type.

Note that the % operator is *not* implemented.  It does not appear
worth the complexity.

Methods
-------

The following methods are implemented by bytes as well as buffer, with
similar semantics.  They accept anything that implements the PEP 3118
buffer API for bytes arguments, and return the same type as the object
whose method is called ("self")::

  .capitalize(), .center(), .count(), .decode(), .endswith(),
  .expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(),
  .islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(),
  .lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(),
  .rjust(), .rpartition(), .rsplit(), .rstrip(), .split(),
  .splitlines(), .startswith(), .strip(), .swapcase(), .title(),
  .translate(), .upper(), .zfill()

This is exactly the set of methods present on the str type in Python
2.x, with the exclusion of .encode().  The signatures and semantics
are the same too.  However, whenever character classes like letter,
whitespace, lower case are used, the ASCII definitions of these
classes are used.  (The Python 2.x str type uses the definitions from
the current locale, settable through the locale module.)  The
.encode() method is left out because of the more strict definitions of
encoding and decoding in Python 3000: encoding always takes a Unicode
string and returns a bytes sequence, and decoding always takes a bytes
sequence and returns a Unicode string.

In addition, both types implement the class method ``.fromhex()``,
which constructs an object from a string containing hexadecimal values
(with or without spaces between the bytes).

The buffer type implements these additional methods from the
MutableSequence ABC (see PEP 3119):

  .extend(), .insert(), .append(), .reverse(), .pop(), .remove().

Bytes and the Str Type
----------------------

Like the bytes type in Python 3.0a1, and unlike the relationship
between str and unicode in Python 2.x, any attempt to mix bytes (or
buffer) objects and str objects without specifying an encoding will
raise a TypeError exception.  This is the case even for simply
comparing a bytes or buffer object to a str object (even violating the
general rule that comparing objects of different types for equality
should just return False).

Conversions between bytes or buffer objects and str objects must
always be explicit, using an encoding.  There are two equivalent APIs:
``str(b, <encoding>[, <errors>])`` is equivalent to
``b.decode(<encoding>[, <errors>])``, and
``bytes(s, <encoding>[, <errors>])`` is equivalent to
``s.encode(<encoding>[, <errors>])``.

There is one exception: we can convert from bytes (or buffer) to str
without specifying an encoding by writing ``str(b)``.  This produces
the same result as ``repr(b)``.  This exception is necessary because
of the general promise that *any* object can be printed, and printing
is just a special case of conversion to str.  There is however no
promise that printing a bytes object interprets the individual bytes
as characters (unlike in Python 2.x).

The str type currently implements the PEP 3118 buffer API.  While this
is perhaps occasionally convenient, it is also potentially confusing,
because the bytes accessed via the buffer API represent a
platform-depending encoding: depending on the platform byte order and
a compile-time configuration option, the encoding could be UTF-16-BE,
UTF-16-LE, UTF-32-BE, or UTF-32-LE.  Worse, a different implementation
of the str type might completely change the bytes representation,
e.g. to UTF-8, or even make it impossible to access the data as a
contiguous array of bytes at all.  Therefore, the PEP 3118 buffer API
will be removed from the str type.

Pickling
--------

Left as an exercise for the reader.

Copyright
=========

This document has been placed in the public domain.

..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)