[Python-Dev] Pre-PEP: The "bytes" object

Neil Schemenauer nas at arctrix.com
Thu Feb 16 03:55:16 CET 2006


This could be a replacement for PEP 332.  At least I hope it can
serve to summarize the previous discussion and help focus on the
currently undecided issues.

I'm too tired to dig up the rules for assigning it a PEP number.
Also, there are probably silly typos, etc.   Sorry.

  Neil
-------------- next part --------------
PEP: XXX
Title: The "bytes" object
Version: $Revision$
Last-Modified: $Date$
Author: Neil Schemenauer <nas at arctrix.com>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 15-Feb-2006
Python-Version: 2.5
Post-History:


Abstract
========

This PEP outlines the introduction of a raw bytes sequence object.
Adding the bytes object is one step in the transition to Unicode based
str objects.


Motivation
==========

Python's current string objects are overloaded. They serve to hold
both sequences of characters and sequences of bytes. This overloading
of purpose leads to confusion and bugs. In future versions of Python,
string objects will be used for holding character data. The bytes object
will fulfil the role of a byte container. Eventually the unicode
built-in will be renamed to str and the str object will be removed.


Specification
=============

A bytes object stores a mutable sequence of integers that are in the
range 0 to 255.  Unlike string objects, indexing a bytes object returns
an integer.  Assigning an element using a object that is not an integer
causes a TypeError exception.  Assigning an element to a value outside
the range 0 to 255 causes a ValueError exception.  The __len__ method of
bytes returns the number of integers stored in the sequence (i.e. the
number of bytes).

The constructor of the bytes object has the following signature:

    bytes([initialiser[, [encoding]])

If no arguments are provided then an object containing zero elements is
created and returned.  The initialiser argument can be a string or a
sequence of integers.  The pseudo-code for the constructor is:

    def bytes(initialiser=[], encoding=None):
        if isinstance(initialiser, basestring):
            if encoding is None or encoding.lower() == 'ascii':
                # raises UnicodeDecodeError if the string contains
                # non-ASCII characters
                initialiser = initialiser.encode('ascii')
            elif isinstance(initialiser, unicode):
                initialiser = initialiser.encode(encoding)
            else:
                # silently ignore the encoding argument if the
                # initialiser is a str object
                pass
            initialiser = [ord(c) for c in initialiser]
        elif encoding is not None:
            raise TypeError("explicit encoding invalid for non-string "
                            "initialiser")
        create bytes object and fill with integers from initialiser
        return bytes object

The __repr__ method returns a string that can be evaluated to generate a
new bytes object containing the same sequence of integers.  The sequence
is represented by a list of ints.  For example:

    >>> repr(bytes[10, 20, 30])
    'bytes([10, 20, 30])'

The object has a decode method equivalent to the decode method of the
str object.  The object has a classmethod fromhex that takes a string of
characters from the set [0-9a-zA-Z ] and returns a bytes object (similar
to binascii.unhexlify).  For example:

    >>> bytes.fromhex('5c5350ff')
    bytes([92, 83, 80, 255]])
    >>> bytes.fromhex('5c 53 50 ff')
    bytes([92, 83, 80, 255]])

The object has a hex method that does the reverse conversion (similar to
binascii.hexlify):

    >> bytes([92, 83, 80, 255]]).hex()
    '5c5350ff'

The bytes object has methods similar to the list object:

    __add__
    __contains__
    __delitem__
    __delslice__
    __eq__
    __ge__
    __getitem__
    __getslice__
    __gt__
    __hash__
    __iadd__
    __imul__
    __iter__
    __le__
    __len__
    __lt__
    __mul__
    __ne__
    __reduce__
    __reduce_ex__
    __repr__
    __rmul__
    __setitem__
    __setslice__
    append
    count
    extend
    index
    insert
    pop
    remove


Out of scope issues
===================

* If we provide a literal syntax for bytes then it should look distinctly
  different than the syntax for literal strings.  Also, a new type, even
  built-in, is much less drastic than a new literal (which requires
  lexer and parser support in addition to everything else).  Since there
  appears to be no immediate need for a literal representation,
  designing and implementing one is out of the scope of this PEP.

* Python 3k will have a much different I/O subsystem.  Deciding how that
  I/O subsystem will work and interact with the bytes object is out of
  the scope of this PEP.

* It has been suggested that a special method named __bytes__ be added
  to language to allow objects to be converted into byte arrays.  This
  decision is out of scope.


Unresolved issues
=================

* Perhaps the bytes object should be implemented as a extension module
  until we are more sure of the design (similar to how the set object
  was prototyped).

* Should the bytes object implement the buffer interface?  Probably, but
  we need to look into the implications of that (e.g. regex operations
  on byte arrays).

* Should the object implement __reversed__ and reverse?  Should it
  implement sort?

* Need to clarify what some of the methods do.  How are comparisons
  done?  Hashing?  Pickling and marshalling?


Questions and answers
=====================

Q: Why have the optional encoding argument when the encode method of
   Unicode objects does the same thing.

A: In the current version of Python, the encode method returns a str
   object and we cannot change that without breaking code.  The construct
   bytes(s.encode(...)) is expensive because it has to copy the byte
   sequence multiple times.  Also, Python generally provides two ways of
   converting an object of type A into an object of type B: ask an A
   instance to convert itself to a B, or ask the type B to create a new
   instance from an A. Depending on what A and B are, both APIs make
   sense; sometimes reasons of decoupling require that A can't know
   about B, in which case you have to use the latter approach; sometimes
   B can't know about A, in which case you have to use the former.


Q: Why does bytes ignore the encoding argument if the initialiser is a
   str?

A: There is no sane meaning that the encoding can have in that case.
   str objects *are* byte arrays and they know nothing about the
   encoding of character data they contain.  We need to assume that the
   programmer has provided str object that already uses the desired
   encoding. If you need something other than a pure copy of the bytes
   then you need to first decode the string.  For example:

       bytes(s.decode(encoding1), encoding2)


Q: Why not have the encoding argument default to Latin-1 (or some other
   encoding that covers the entire byte range) rather than ASCII ?

A: The system default encoding for Python is ASCII.  It seems least
   confusing to use that default.  Also, in Py3k, using Latin-1 as
   the default might not be what users expect.  For example, they might
   prefer a Unicode encoding.  Any default will not always work as
   expected.  At least ASCII will complain loudly if you try to encode
   non-ASCII data.


Copyright
=========

This document has been placed in the public domain.



..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   End:


More information about the Python-Dev mailing list