[Python-checkins] r53862 - peps/trunk/pep-0358.txt
guido.van.rossum
python-checkins at python.org
Fri Feb 23 05:31:19 CET 2007
Author: guido.van.rossum
Date: Fri Feb 23 05:31:15 2007
New Revision: 53862
Modified:
peps/trunk/pep-0358.txt
Log:
Another update, clarifying (I hope) the method signatures and mentioning
other stuff that came up over dinner.
Modified: peps/trunk/pep-0358.txt
==============================================================================
--- peps/trunk/pep-0358.txt (original)
+++ peps/trunk/pep-0358.txt Fri Feb 23 05:31:15 2007
@@ -13,9 +13,16 @@
Abstract
- This PEP outlines the introduction of a raw bytes sequence object.
- Adding the bytes object is one step in the transition to Unicode
- based str objects.
+ This PEP outlines the introduction of a raw bytes sequence type.
+ Adding the bytes type is one step in the transition to Unicode
+ based str objects which will be introduced in Python 3.0.
+
+ The PEP describes how the bytes type should work in Python 2.6, as
+ well as how it should work in Python 3.0. (Occasionally there are
+ differences because in Python 2.6, we have two string types, str
+ and unicode, while in Python 3.0 we will only have one string
+ type, whose name will be str but whose semantics will be like the
+ 2.6 unicode type.)
Motivation
@@ -33,39 +40,48 @@
A bytes object stores a mutable sequence of integers that are in
the range 0 to 255. Unlike string objects, indexing a bytes
- object returns an integer. Assigning an element using a object
- that is not an integer causes a TypeError exception. Assigning an
- element to a value outside the range 0 to 255 causes a ValueError
- exception. The .__len__() method of bytes returns the number of
- integers stored in the sequence (i.e. the number of bytes).
+ object returns an integer. Assigning or comparin an object that
+ is not an integer to an element causes a TypeError exception.
+ Assigning an element to a value outside the range 0 to 255 causes
+ a ValueError exception. The .__len__() method of bytes returns
+ the number of integers stored in the sequence (i.e. the number of
+ bytes).
The constructor of the bytes object has the following signature:
- bytes([initialiser[, [encoding]])
+ bytes([initializer[, encoding]])
- If no arguments are provided then an object containing zero elements
- is created and returned. The initialiser argument can be a string,
- a sequence of integers, or a single integer. The pseudo-code for the
- constructor is:
-
- def bytes(initialiser=[], encoding=None):
- if isinstance(initialiser, int): # In 2.6, (int, long)
- initialiser = [0]*initialiser
- elif isinstance(initialiser, basestring):
- if isinstance(initialiser, unicode): # In 3.0, always
+ If no arguments are provided then a bytes object containing zero
+ elements is created and returned. The initializer argument can be
+ a string (in 2.6, either str or unicode), an iterable of integers,
+ or a single integer. The pseudo-code for the constructor
+ (optimized for clear semantics, not for speed) is:
+
+ def bytes(initializer=0, encoding=None):
+ if isinstance(initializer, int): # In 2.6, (int, long)
+ initializer = [0]*initializer
+ elif isinstance(initializer, basestring):
+ if isinstance(initializer, unicode): # In 3.0, always
if encoding is None:
# In 3.0, raise TypeError("explicit encoding required")
encoding = sys.getdefaultencoding()
- initialiser = initialiser.encode(encoding)
- initialiser = [ord(c) for c in initialiser]
+ initializer = initializer.encode(encoding)
+ initializer = [ord(c) for c in initializer]
else:
if encoding is not None:
- raise TypeError("explicit encoding invalid for non-string "
- "initialiser")
- # Create bytes object and fill with integers from initialiser
- # while ensuring each integer is in range(256); initialiser
- # can be any iterable
- return bytes object
+ raise TypeError("no encoding allowed for this initializer")
+ tmp = []
+ for c in initializer:
+ if not isinstance(c, int):
+ raise TypeError("initializer must be iterable of ints")
+ if not 0 <= c < 256:
+ raise ValueError("initializer element out of range")
+ tmp.append(c)
+ initializer = tmp
+ new = <new bytes object of length len(initializer)>
+ for i, c in enumerate(initializer):
+ new[i] = c
+ return new
The .__repr__() method returns a string that can be evaluated to
generate a new bytes object containing the same sequence of
@@ -76,13 +92,10 @@
'bytes([0x0a, 0x14, 0x1e])'
The object has a .decode() method equivalent to the .decode()
- method of the str object. (This is redundant since it can also be
- decoded by calling unicode(b, <encoding>) (in 2.6) or str(b,
- <encoding>) (in 3.0); do we need encode/decode methods? In a
- sense the spelling using a constructor is cleaner.) The object
- has a classmethod .fromhex() that takes a string of characters
- from the set [0-9a-zA-Z ] and returns a bytes object (similar to
- binascii.unhexlify). For example:
+ method of the str object. The object has a classmethod .fromhex()
+ that takes a string of characters from the set [0-9a-zA-Z ] and
+ returns a bytes object (similar to binascii.unhexlify). For
+ example:
>>> bytes.fromhex('5c5350ff')
bytes([92, 83, 80, 255]])
@@ -96,102 +109,118 @@
'5c5350ff'
The bytes object has some methods similar to list method, and
- others similar to str methods:
+ others similar to str methods. Here is a complete list of
+ methods, with their approximate signatures:
- __add__
- __contains__ (with int arg, like list; with bytes arg, like str)
- __delitem__
- __delslice__
- __eq__
- __ge__
- __getitem__
- __getslice__
- __gt__
- __iadd__
- __imul__
- __iter__
- __le__
- __len__
- __lt__
- __mul__
- __ne__
- __reduce__
- __reduce_ex__
- __repr__
- __reversed__
- __rmul__
- __setitem__
- __setslice__
- append
- count
- decode
- endswith
- extend
- find
- index
- insert
- join
- partition
- pop
- remove
- replace
- rindex
- rpartition
- split
- startswith
- reverse
- rfind
- rindex
- rsplit
- translate
+ .__add__(bytes) -> bytes
+ .__contains__(int | bytes) -> bool
+ .__delitem__(int | slice) -> None
+ .__delslice__(int, int) -> None
+ .__eq__(bytes) -> bool
+ .__ge__(bytes) -> bool
+ .__getitem__(int | slice) -> int | bytes
+ .__getslice__(int, int) -> bytes
+ .__gt__(bytes) -> bool
+ .__iadd__(bytes) -> bytes
+ .__imul__(int) -> bytes
+ .__iter__() -> iterator
+ .__le__(bytes) -> bool
+ .__len__() -> int
+ .__lt__(bytes) -> bool
+ .__mul__(int) -> bytes
+ .__ne__(bytes) -> bool
+ .__reduce__(...) -> ...
+ .__reduce_ex__(...) -> ...
+ .__repr__() -> str
+ .__reversed__() -> bytes
+ .__rmul__(int) -> bytes
+ .__setitem__(int | slice, int | iterable[int]) -> None
+ .__setslice__(int, int, iterable[int]) -> Bote
+ .append(int) -> None
+ .count(int) -> int
+ .decode(str) -> str | unicode # in 3.0, only str
+ .endswith(bytes) -> bool
+ .extend(iterable[int]) -> None
+ .find(bytes) -> int
+ .index(bytes | int) -> int
+ .insert(int, int) -> None
+ .join(iterable[bytes]) -> bytes
+ .partition(bytes) -> (bytes, bytes, bytes)
+ .pop([int]) -> int
+ .remove(int) -> None
+ .replace(bytes, bytes) -> bytes
+ .rindex(bytes | int) -> int
+ .rpartition(bytes) -> (bytes, bytes, bytes)
+ .split(bytes) -> list[bytes]
+ .startswith(bytes) -> bool
+ .reverse() -> None
+ .rfind(bytes) -> int
+ .rindex(bytes | int) -> int
+ .rsplit(bytes) -> list[bytes]
+ .translate(bytes, [bytes]) -> bytes
Note the conspicuous absence of .isupper(), .upper(), and friends.
- There is no __hash__ because the object is mutable. There is no
- usecase for a .sort() method.
+ (But see "Open Issues" below.) There is no .__hash__() because
+ the object is mutable. There is no use case for a .sort() method.
- The bytes also supports the buffer interface, supporting reading
- and writing binary (but not character) data.
+ The bytes type also supports the buffer interface, supporting
+ reading and writing binary (but not character) data.
-Out of scope issues
+Out of Scope Issues
- * If we provide a literal syntax for bytes then it should look
- distinctly different than the syntax for literal strings. Also, a
- new type, even built-in, is much less drastic than a new literal
- (which requires lexer and parser support in addition to everything
- else). Since there appears to be no immediate need for a literal
- representation, designing and implementing one is out of the scope
- of this PEP. (Hmm... A b"..." literal accepting only ASCII
- values is likely to be added to 3.0; not clear about 2.6. This
- needs a PEP.)
-
- * Python 3k will have a much different I/O subsystem. Deciding how
- that I/O subsystem will work and interact with the bytes object is
- out of the scope of this PEP.
+ * Python 3k will have a much different I/O subsystem. Deciding
+ how that I/O subsystem will work and interact with the bytes
+ object is out of the scope of this PEP. The expectation however
+ is that binary I/O will read and write bytes, while text I/O
+ will read strings. Since the bytes type supports the buffer
+ interface, the existing binary I/O operations in Python 2.6 will
+ support bytes objects.
- * It has been suggested that a special method named __bytes__ be
- added to language to allow objects to be converted into byte
+ * It has been suggested that a special method named .__bytes__()
+ be added to language to allow objects to be converted into byte
arrays. This decision is out of scope.
-Unresolved issues
+Open Issues
- * Need to specify the methods more carefully.
+ * The .decode() method is redundant since a bytes object b can
+ also be decoded by calling unicode(b, <encoding>) (in 2.6) or
+ str(b, <encoding>) (in 3.0). Do we need encode/decode methods
+ at all? In a sense the spelling using a constructor is cleaner.
+
+ * Need to specify the methods still more carefully.
+
+ * Pickling and marshalling support need to be specified.
* Should all those list methods really be implemented?
+ * There is growing support for a b"..." literal. Here's a brief
+ spec. Each invocation of b"..." produces a new bytes object
+ (this is unlike "..." but similar to [...] and {...}). Inside
+ the literal, only ASCII characters and non-Unicode backslash
+ escapes are allowed; non-ASCII characters not specified as
+ escapes are rejected by the compiler regardless of the source
+ encoding. The resulting object's value is the same as if
+ bytes(map(ord, "...")) were called.
+
* A case could be made for supporting .ljust(), .rjust(),
.center() with a mandatory second argument.
* A case could be made for supporting .split() with a mandatory
argument.
- * How should pickling and marshalling work?
-
- * I probably forgot a few things.
+ * A case could even be made for supporting .islower(), .isupper(),
+ .isspace(), .isalpha(), .isalnum(), .isdigit() and the
+ corresponding conversions (.lower() etc.), using the ASCII
+ definitions for letters, digits and whitespace. If this is
+ accepted, the cases for .ljust(), .rjust(), .center() and
+ .split() become much stronger, and they should have default
+ arguments as well, using an ASCII space or all ASCII whitespace
+ (for .split()).
-Questions and answers
+Frequently Asked Questions
Q: Why have the optional encoding argument when the encode method of
Unicode objects does the same thing.
@@ -209,7 +238,7 @@
in which case you have to use the former.
- Q: Why does bytes ignore the encoding argument if the initialiser is
+ Q: Why does bytes ignore the encoding argument if the initializer is
a str? (This only applies to 2.6.)
A: There is no sane meaning that the encoding can have in that case.
More information about the Python-checkins
mailing list