[Python-checkins] r42554 - peps/trunk/pep-0358.txt
neil.schemenauer
python-checkins at python.org
Wed Feb 22 21:40:05 CET 2006
Author: neil.schemenauer
Date: Wed Feb 22 21:40:03 2006
New Revision: 42554
Added:
peps/trunk/pep-0358.txt
Log:
Add 'The "bytes" object' PEP.
Added: peps/trunk/pep-0358.txt
==============================================================================
--- (empty file)
+++ peps/trunk/pep-0358.txt Wed Feb 22 21:40:03 2006
@@ -0,0 +1,215 @@
+PEP: 358
+Title: The "bytes" object
+Version: $Revision$
+Last-Modified: $Date$
+Author: Neil Schemenauer <nas at arctrix.com>
+Status: Draft
+Type: Standards Track
+Content-Type: text/plain
+Created: 15-Feb-2006
+Python-Version: 2.5
+Post-History:
+
+
+Abstract
+========
+
+This PEP outlines the introduction of a raw bytes sequence object.
+Adding the bytes object is one step in the transition to Unicode based
+str objects.
+
+
+Motivation
+==========
+
+Python's current string objects are overloaded. They serve to hold
+both sequences of characters and sequences of bytes. This overloading
+of purpose leads to confusion and bugs. In future versions of Python,
+string objects will be used for holding character data. The bytes object
+will fulfil the role of a byte container. Eventually the unicode
+built-in will be renamed to str and the str object will be removed.
+
+
+Specification
+=============
+
+A bytes object stores a mutable sequence of integers that are in the
+range 0 to 255. Unlike string objects, indexing a bytes object returns
+an integer. Assigning an element using a object that is not an integer
+causes a TypeError exception. Assigning an element to a value outside
+the range 0 to 255 causes a ValueError exception. The __len__ method of
+bytes returns the number of integers stored in the sequence (i.e. the
+number of bytes).
+
+The constructor of the bytes object has the following signature:
+
+ bytes([initialiser[, [encoding]])
+
+If no arguments are provided then an object containing zero elements is
+created and returned. The initialiser argument can be a string or a
+sequence of integers. The pseudo-code for the constructor is:
+
+ def bytes(initialiser=[], encoding=None):
+ if isinstance(initialiser, basestring):
+ if isinstance(initialiser, unicode):
+ if encoding is None:
+ encoding = sys.getdefaultencoding()
+ initialiser = initialiser.encode(encoding)
+ initialiser = [ord(c) for c in initialiser]
+ elif encoding is not None:
+ raise TypeError("explicit encoding invalid for non-string "
+ "initialiser")
+ create bytes object and fill with integers from initialiser
+ return bytes object
+
+The __repr__ method returns a string that can be evaluated to generate a
+new bytes object containing the same sequence of integers. The sequence
+is represented by a list of ints. For example:
+
+ >>> repr(bytes[10, 20, 30])
+ 'bytes([10, 20, 30])'
+
+The object has a decode method equivalent to the decode method of the
+str object. The object has a classmethod fromhex that takes a string of
+characters from the set [0-9a-zA-Z ] and returns a bytes object (similar
+to binascii.unhexlify). For example:
+
+ >>> bytes.fromhex('5c5350ff')
+ bytes([92, 83, 80, 255]])
+ >>> bytes.fromhex('5c 53 50 ff')
+ bytes([92, 83, 80, 255]])
+
+The object has a hex method that does the reverse conversion (similar to
+binascii.hexlify):
+
+ >> bytes([92, 83, 80, 255]]).hex()
+ '5c5350ff'
+
+The bytes object has methods similar to the list object:
+
+ __add__
+ __contains__
+ __delitem__
+ __delslice__
+ __eq__
+ __ge__
+ __getitem__
+ __getslice__
+ __gt__
+ __hash__
+ __iadd__
+ __imul__
+ __iter__
+ __le__
+ __len__
+ __lt__
+ __mul__
+ __ne__
+ __reduce__
+ __reduce_ex__
+ __repr__
+ __rmul__
+ __setitem__
+ __setslice__
+ append
+ count
+ extend
+ index
+ insert
+ pop
+ remove
+
+
+Out of scope issues
+===================
+
+* If we provide a literal syntax for bytes then it should look distinctly
+ different than the syntax for literal strings. Also, a new type, even
+ built-in, is much less drastic than a new literal (which requires
+ lexer and parser support in addition to everything else). Since there
+ appears to be no immediate need for a literal representation,
+ designing and implementing one is out of the scope of this PEP.
+
+* Python 3k will have a much different I/O subsystem. Deciding how that
+ I/O subsystem will work and interact with the bytes object is out of
+ the scope of this PEP.
+
+* It has been suggested that a special method named __bytes__ be added
+ to language to allow objects to be converted into byte arrays. This
+ decision is out of scope.
+
+
+Unresolved issues
+=================
+
+* Perhaps the bytes object should be implemented as a extension module
+ until we are more sure of the design (similar to how the set object
+ was prototyped).
+
+* Should the bytes object implement the buffer interface? Probably, but
+ we need to look into the implications of that (e.g. regex operations
+ on byte arrays).
+
+* Should the object implement __reversed__ and reverse? Should it
+ implement sort?
+
+* Need to clarify what some of the methods do. How are comparisons
+ done? Hashing? Pickling and marshalling?
+
+
+Questions and answers
+=====================
+
+Q: Why have the optional encoding argument when the encode method of
+ Unicode objects does the same thing.
+
+A: In the current version of Python, the encode method returns a str
+ object and we cannot change that without breaking code. The construct
+ bytes(s.encode(...)) is expensive because it has to copy the byte
+ sequence multiple times. Also, Python generally provides two ways of
+ converting an object of type A into an object of type B: ask an A
+ instance to convert itself to a B, or ask the type B to create a new
+ instance from an A. Depending on what A and B are, both APIs make
+ sense; sometimes reasons of decoupling require that A can't know
+ about B, in which case you have to use the latter approach; sometimes
+ B can't know about A, in which case you have to use the former.
+
+
+Q: Why does bytes ignore the encoding argument if the initialiser is a
+ str?
+
+A: There is no sane meaning that the encoding can have in that case.
+ str objects *are* byte arrays and they know nothing about the
+ encoding of character data they contain. We need to assume that the
+ programmer has provided str object that already uses the desired
+ encoding. If you need something other than a pure copy of the bytes
+ then you need to first decode the string. For example:
+
+ bytes(s.decode(encoding1), encoding2)
+
+
+Q: Why not have the encoding argument default to Latin-1 (or some other
+ encoding that covers the entire byte range) rather than ASCII ?
+
+A: The system default encoding for Python is ASCII. It seems least
+ confusing to use that default. Also, in Py3k, using Latin-1 as
+ the default might not be what users expect. For example, they might
+ prefer a Unicode encoding. Any default will not always work as
+ expected. At least ASCII will complain loudly if you try to encode
+ non-ASCII data.
+
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+
+
+..
+ Local Variables:
+ mode: indented-text
+ indent-tabs-mode: nil
+ sentence-end-double-space: t
+ fill-column: 70
+ End:
More information about the Python-checkins
mailing list