[Python-ideas] Python 3.x and bytes
Terry Reedy
tjreedy at udel.edu
Thu May 19 05:10:01 CEST 2011
On 5/18/2011 4:10 PM, Ethan Furman wrote:
> As those who have to work with byte strings know, when retrieving a
> single character from a byte string, what you get back is not a byte
> string, but an int -- a rather important distinction from unicode
> strings (str).
For all sequences, slicing (if it works at all) returns a subsequence
(possibly of length 0, which is why slicing can work with out-of-bounds
slice points). For all (built-in) sequences except for strings, indexing
returns a member of the sequence (which is why it raises an exception
for out-of-bounds indexes). Leaving aside extension and user-defined
sequences, strings are unique in instead returning a length-1
subsequence So bytes are normal while strings are anomolous!
Why that anomaly? The immediate reason is that Python does not have a
separate character type. Why not? Guido might best answer (but he might
say 'my gut instinct'), but I can think of a few reasons.
1. That is how it is in the (math) theory of strings. 'A' is both a char
and a string of length one. There is no separate 'char' type that cannot
be added (concatenated) to other strings of whatever length.
2. (Related) This pragmatically works best for Python.
3. Python follows Occam's principle by not introducing types without
necessity. And a separate char type is not *necessary*.
4. Text strings are homegeneous arrays (like the arrays in the array
module), unlike heterogeneous tuples and lists. So they need not be
sequences of Python objects, and for efficiency, would not be even if
there were a character type. Like other arrays, they contain the
information needed to produce Python objects on demand without actually
containing such objects in the way tuples, lists, and dicts do.
I do, however, understand the tendency to think of bytes as strings
because of both Python's history and the remnant string interface.
For people using non-Latin (non-ascii) alphabets, the 'convenience' of
replacing some bytes with ascii-chars might be less convenient.
--
Terry Jan Reedy
More information about the Python-ideas
mailing list