[Python-checkins] r42554 - peps/trunk/pep-0358.txt

Wed Feb 22 21:40:05 CET 2006

Author: neil.schemenauer
Date: Wed Feb 22 21:40:03 2006
New Revision: 42554

Added:
   peps/trunk/pep-0358.txt
Log:
Add 'The "bytes" object' PEP.


Added: peps/trunk/pep-0358.txt
==============================================================================

--- (empty file)
+++ peps/trunk/pep-0358.txt	Wed Feb 22 21:40:03 2006
@@ -0,0 +1,215 @@
+PEP: 358
+Title: The "bytes" object
+Version: $Revision$
+Last-Modified: $Date$
+Author: Neil Schemenauer <nas at arctrix.com>
+Status: Draft
+Type: Standards Track
+Content-Type: text/plain
+Created: 15-Feb-2006
+Python-Version: 2.5
+Post-History:
+
+
+Abstract
+========
+
+This PEP outlines the introduction of a raw bytes sequence object.
+Adding the bytes object is one step in the transition to Unicode based
+str objects.
+
+
+Motivation
+==========
+
+Python's current string objects are overloaded. They serve to hold
+both sequences of characters and sequences of bytes. This overloading
+of purpose leads to confusion and bugs. In future versions of Python,
+string objects will be used for holding character data. The bytes object
+will fulfil the role of a byte container. Eventually the unicode
+built-in will be renamed to str and the str object will be removed.
+
+
+Specification
+=============
+
+A bytes object stores a mutable sequence of integers that are in the
+range 0 to 255.  Unlike string objects, indexing a bytes object returns
+an integer.  Assigning an element using a object that is not an integer
+causes a TypeError exception.  Assigning an element to a value outside
+the range 0 to 255 causes a ValueError exception.  The __len__ method of
+bytes returns the number of integers stored in the sequence (i.e. the
+number of bytes).
+
+The constructor of the bytes object has the following signature:
+
+    bytes([initialiser[, [encoding]])
+
+If no arguments are provided then an object containing zero elements is
+created and returned.  The initialiser argument can be a string or a
+sequence of integers.  The pseudo-code for the constructor is:
+
+    def bytes(initialiser=[], encoding=None):
+        if isinstance(initialiser, basestring):
+            if isinstance(initialiser, unicode):
+                if encoding is None:
+                    encoding = sys.getdefaultencoding()
+                initialiser = initialiser.encode(encoding)
+            initialiser = [ord(c) for c in initialiser]
+        elif encoding is not None:
+            raise TypeError("explicit encoding invalid for non-string "
+                            "initialiser")
+        create bytes object and fill with integers from initialiser
+        return bytes object
+
+The __repr__ method returns a string that can be evaluated to generate a
+new bytes object containing the same sequence of integers.  The sequence
+is represented by a list of ints.  For example:
+
+    >>> repr(bytes[10, 20, 30])
+    'bytes([10, 20, 30])'
+
+The object has a decode method equivalent to the decode method of the
+str object.  The object has a classmethod fromhex that takes a string of
+characters from the set [0-9a-zA-Z ] and returns a bytes object (similar
+to binascii.unhexlify).  For example:
+
+    >>> bytes.fromhex('5c5350ff')
+    bytes([92, 83, 80, 255]])
+    >>> bytes.fromhex('5c 53 50 ff')
+    bytes([92, 83, 80, 255]])
+
+The object has a hex method that does the reverse conversion (similar to
+binascii.hexlify):
+
+    >> bytes([92, 83, 80, 255]]).hex()
+    '5c5350ff'
+
+The bytes object has methods similar to the list object:
+
+    __add__
+    __contains__
+    __delitem__
+    __delslice__
+    __eq__
+    __ge__
+    __getitem__
+    __getslice__
+    __gt__
+    __hash__
+    __iadd__
+    __imul__
+    __iter__
+    __le__
+    __len__
+    __lt__
+    __mul__
+    __ne__
+    __reduce__
+    __reduce_ex__
+    __repr__
+    __rmul__
+    __setitem__
+    __setslice__
+    append
+    count
+    extend
+    index
+    insert
+    pop
+    remove
+
+
+Out of scope issues
+===================
+
+* If we provide a literal syntax for bytes then it should look distinctly
+  different than the syntax for literal strings.  Also, a new type, even
+  built-in, is much less drastic than a new literal (which requires
+  lexer and parser support in addition to everything else).  Since there
+  appears to be no immediate need for a literal representation,
+  designing and implementing one is out of the scope of this PEP.
+
+* Python 3k will have a much different I/O subsystem.  Deciding how that
+  I/O subsystem will work and interact with the bytes object is out of
+  the scope of this PEP.
+
+* It has been suggested that a special method named __bytes__ be added
+  to language to allow objects to be converted into byte arrays.  This
+  decision is out of scope.
+
+
+Unresolved issues
+=================
+
+* Perhaps the bytes object should be implemented as a extension module
+  until we are more sure of the design (similar to how the set object
+  was prototyped).
+
+* Should the bytes object implement the buffer interface?  Probably, but
+  we need to look into the implications of that (e.g. regex operations
+  on byte arrays).
+
+* Should the object implement __reversed__ and reverse?  Should it
+  implement sort?
+
+* Need to clarify what some of the methods do.  How are comparisons
+  done?  Hashing?  Pickling and marshalling?
+
+
+Questions and answers
+=====================
+
+Q: Why have the optional encoding argument when the encode method of
+   Unicode objects does the same thing.
+
+A: In the current version of Python, the encode method returns a str
+   object and we cannot change that without breaking code.  The construct
+   bytes(s.encode(...)) is expensive because it has to copy the byte
+   sequence multiple times.  Also, Python generally provides two ways of
+   converting an object of type A into an object of type B: ask an A
+   instance to convert itself to a B, or ask the type B to create a new
+   instance from an A. Depending on what A and B are, both APIs make
+   sense; sometimes reasons of decoupling require that A can't know
+   about B, in which case you have to use the latter approach; sometimes
+   B can't know about A, in which case you have to use the former.
+
+
+Q: Why does bytes ignore the encoding argument if the initialiser is a
+   str?
+
+A: There is no sane meaning that the encoding can have in that case.
+   str objects *are* byte arrays and they know nothing about the
+   encoding of character data they contain.  We need to assume that the
+   programmer has provided str object that already uses the desired
+   encoding. If you need something other than a pure copy of the bytes
+   then you need to first decode the string.  For example:
+
+       bytes(s.decode(encoding1), encoding2)
+
+
+Q: Why not have the encoding argument default to Latin-1 (or some other
+   encoding that covers the entire byte range) rather than ASCII ?
+
+A: The system default encoding for Python is ASCII.  It seems least
+   confusing to use that default.  Also, in Py3k, using Latin-1 as
+   the default might not be what users expect.  For example, they might
+   prefer a Unicode encoding.  Any default will not always work as
+   expected.  At least ASCII will complain loudly if you try to encode
+   non-ASCII data.
+
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+
+
+..
+   Local Variables:
+   mode: indented-text
+   indent-tabs-mode: nil
+   sentence-end-double-space: t
+   fill-column: 70
+   End: