[Python-Dev] bytes type discussion

Wed Feb 15 01:24:46 CET 2006

On Tue, Feb 14, 2006 at 03:13:25PM -0800, Guido van Rossum wrote:

> Martin von Loewis's alternative for the "very controversial" set is to
> disallow an encoding argument and (I believe) also to disallow Unicode
> arguments. In 3.0 this would leave us with s.encode(<encoding>) as the
> only way to convert a string (which is always unicode) to bytes. The
> problem with this is that there's no code that works in both 2.x and
> 3.0.

Unless you only ever create (byte)strings by doing s.encode(), and only send
them to code that is either byte/string-agnostic or -aware. Oh, and don't
use indexing, only slicing (length-1 if you have to.) I guess it depends on
howmuch code will accept a bytes-string where currently a string is the norm
(and a unicode object is default-encoded.)

I'm still worried that all this is quite a big leap. Very few people
understand the intricacies of unicode encodings. (Almost everyone
understands unicode, except they don't know it yet; it's the encodings that
are the problem.) By forcing everything to be unicode without a uniform
encoding-detection scheme, we're forcing every programmer who opens a file
or reads from the network to think about encodings. This will be a pretty
big step for newbie programmers.

And it's not just that. The encoding of network streams or files may be
entirely unknown beforehand, and depend on the content: a content-encoding,
a <META EQUIV> HTML tag. Will bytes-strings get string methods for easy
searching of content descriptors? Will the 're' module accept bytes-strings?
What would the literals you want to search for, look like? Do I really do
'if bytes("Content-Type:") in data:' and such? Should data perhaps get read
using the opentext() equivalent of 'decode('ascii', 'replace')' and then
parsed the 'normal' way? What about data gotten from an extension? And
nevermind what the 'right way' for that is; what will *programmers* do? The
'right way' often escapes them.

It may well be that I'm thinking too conservatively, too stuck in the old
ways, but I think we're being too hasty in dismissing the ol' string. Don't
get me wrong, I really like the idea of as much of Python doing unicode as
possible, and the idea of a mutable bytes type sounds good to me too. I just
don't like the wide gap between the troublesome-to-get unicode object and
the unreadable-repr, weird-indexing, hard-to-work-with bytes-string. I don't
think adding something inbetween is going to work (we basically have that
now, the normal string), so I suggest the bytes-string becomes a bit more
'string' and a bit less 'sequence of bytes'. Perhaps in the form of:

 - A bytes type that repr()'s to something readable

 - A way to write byte literals that doesn't bleed the eyes, and isn't so
   fragile in the face of source-encoding (all the suggestions so far have
   you explicitly re-stating the source-encoding at each bytes("".encode()))
   If you have to wonder why that's fragile, just think about a recoding
   editor. Alternatively, get a short way to say 'encode in source-encoding'

 (I can't think of anything better than b"..." for the above two...
  Except... hmm... didn't `` become available in Py3k? Too little visual
  distinction?)

 - A way to manipulation the bytes as character-strings. Pattern matching,
   splitting, finding, slicing, etc. Quite like current strings.

 - Disallowing any interaction between bytes and real (meaning 'unicode')
   strings. Not "oh, let's assume ascii or the default encoding", either. If
   the user wants to explicitly decode using 'ascii', that's their choice,
   but they should consciously make it.

 - Mutable or immutable, I don't know. I fear that if the bytes type was
   easy enough to handle and mutable, and the normal (unicode) strings were
   immutable, people may end up using bytes all the time. In fact, they may
   do that anyway; I'm sure Python will grow entire subcults that prefer
   doing 'string("\xa1Python!")' where 'string' is
   'bytes(arg.encode("iso-8859-1"))'

Bytes should be easy enough to manipulate 'as strings' to do the basic
tasks, but not easy enough to encourage people to forget about that whole
annoying 'encoding' business and just use them instead (which is basically
what we have now.) On the other hand, if people don't want to deal with that
whole encoding business, we should allow them to -- consciously. We can
offer a variety of hints and tips on how to figure out the encoding of
something, but we can't do the thinking for them (trust me, I've tried.)

When a file's encoding is specified in file metadata, that's great, really
great. When a network connection is handled by a library that knows how to
deal with the content (*cough*Twisted*cough*) and can decode it for you,
that's really great too. But we're not there yet, not by a long shot. And
explaining encodings to a ADHD-infested teenager high on adrenalin and
creative inspiration who just wants to connect to an IRC server to make his
bot say "Hi!", well, that's hard. I'd rather they don't go and do PHP
instead. Doing it right is hard, but it's even harder to do it all right the
first time, and Python never really worried about that ;P

-- 
Thomas Wouters <thomas at xs4all.net>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!