[Python-ideas] Python 3.x and bytes

Thu May 19 07:02:32 CEST 2011

On Wed, May 18, 2011 at 11:10 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> For all sequences, slicing (if it works at all) returns a subsequence
> (possibly of length 0, which is why slicing can work with out-of-bounds
> slice points). For all (built-in) sequences except for strings, indexing
> returns a member of the sequence (which is why it raises an exception for
> out-of-bounds indexes). Leaving aside extension and user-defined sequences,
> strings are unique in instead returning a length-1 subsequence So bytes are
> normal while strings are anomolous!

I don't see the necessity of saying that length-1 strings aren't
members of strings. For all definitions I can think of for "member of
the sequence", they are. You get them when you iterate over them, you
get them when you use index access, they work with .index(). They have
a sort of infinite regress / cycle to them ("it's strings all the way
down"), but you can get that with lists too (x = []; x.append(x); y =
x + x -- compare with x = 'a'; y = x + x).

> 1. That is how it is in the (math) theory of strings. 'A' is both a char and
> a string of length one. There is no separate 'char' type that cannot be
> added (concatenated) to other strings of whatever length.

At least in the context of formal language theory (e.g. Sipser's
Introduction to the Theory of Computation), characters (symbols) are a
separate thing from strings. You have your alphabet, Sigma, which is
an arbitrary set, and strings are finite sequences of elements from
Sigma.

In Python's case, it's chosen an alphabet where all elements are
length-1 strings in the alphabet. I don't think that's really
well-formed using this definition of string and ZFC, and the usual
definitions of finite sequences (functions or linked-lists). It
doesn't really matter, you can model it in something else.

> I do, however, understand the tendency to think of bytes as strings because
> of both Python's history and the remnant string interface.

I would add the syntax of bytes literals to the list of similarities.
br'\foo' versus r'\foo' makes them very similar.

> For people using non-Latin (non-ascii) alphabets, the 'convenience' of
> replacing some bytes with ascii-chars might be less convenient.

Eh, actually I think what was suggested was having w.g. b'\x42' ==
0x42 by making singleton bytes objects equal to the appropriate
integer. This would work for all bytes, not just those smaller than
128.

Devin Jeanpierre