
On Wed, May 18, 2011 at 11:10 PM, Terry Reedy <tjreedy@udel.edu> wrote:
For all sequences, slicing (if it works at all) returns a subsequence (possibly of length 0, which is why slicing can work with out-of-bounds slice points). For all (built-in) sequences except for strings, indexing returns a member of the sequence (which is why it raises an exception for out-of-bounds indexes). Leaving aside extension and user-defined sequences, strings are unique in instead returning a length-1 subsequence So bytes are normal while strings are anomolous!
I don't see the necessity of saying that length-1 strings aren't members of strings. For all definitions I can think of for "member of the sequence", they are. You get them when you iterate over them, you get them when you use index access, they work with .index(). They have a sort of infinite regress / cycle to them ("it's strings all the way down"), but you can get that with lists too (x = []; x.append(x); y = x + x -- compare with x = 'a'; y = x + x).
1. That is how it is in the (math) theory of strings. 'A' is both a char and a string of length one. There is no separate 'char' type that cannot be added (concatenated) to other strings of whatever length.
At least in the context of formal language theory (e.g. Sipser's Introduction to the Theory of Computation), characters (symbols) are a separate thing from strings. You have your alphabet, Sigma, which is an arbitrary set, and strings are finite sequences of elements from Sigma. In Python's case, it's chosen an alphabet where all elements are length-1 strings in the alphabet. I don't think that's really well-formed using this definition of string and ZFC, and the usual definitions of finite sequences (functions or linked-lists). It doesn't really matter, you can model it in something else.
I do, however, understand the tendency to think of bytes as strings because of both Python's history and the remnant string interface.
I would add the syntax of bytes literals to the list of similarities. br'\foo' versus r'\foo' makes them very similar.
For people using non-Latin (non-ascii) alphabets, the 'convenience' of replacing some bytes with ascii-chars might be less convenient.
Eh, actually I think what was suggested was having w.g. b'\x42' == 0x42 by making singleton bytes objects equal to the appropriate integer. This would work for all bytes, not just those smaller than 128. Devin Jeanpierre