
On 5/18/2011 4:10 PM, Ethan Furman wrote:
As those who have to work with byte strings know, when retrieving a single character from a byte string, what you get back is not a byte string, but an int -- a rather important distinction from unicode strings (str).
For all sequences, slicing (if it works at all) returns a subsequence (possibly of length 0, which is why slicing can work with out-of-bounds slice points). For all (built-in) sequences except for strings, indexing returns a member of the sequence (which is why it raises an exception for out-of-bounds indexes). Leaving aside extension and user-defined sequences, strings are unique in instead returning a length-1 subsequence So bytes are normal while strings are anomolous! Why that anomaly? The immediate reason is that Python does not have a separate character type. Why not? Guido might best answer (but he might say 'my gut instinct'), but I can think of a few reasons. 1. That is how it is in the (math) theory of strings. 'A' is both a char and a string of length one. There is no separate 'char' type that cannot be added (concatenated) to other strings of whatever length. 2. (Related) This pragmatically works best for Python. 3. Python follows Occam's principle by not introducing types without necessity. And a separate char type is not *necessary*. 4. Text strings are homegeneous arrays (like the arrays in the array module), unlike heterogeneous tuples and lists. So they need not be sequences of Python objects, and for efficiency, would not be even if there were a character type. Like other arrays, they contain the information needed to produce Python objects on demand without actually containing such objects in the way tuples, lists, and dicts do. I do, however, understand the tendency to think of bytes as strings because of both Python's history and the remnant string interface. For people using non-Latin (non-ascii) alphabets, the 'convenience' of replacing some bytes with ascii-chars might be less convenient. -- Terry Jan Reedy