[Python-Dev] PEP 332 revival in coordination with pep 349? [ Was:Re: release plan for 2.5 ?]

Tue Feb 14 09:09:22 CET 2006

On Mon, Feb 13, 2006 at 03:44:27PM -0800, Guido van Rossum wrote:

> But adding an encoding doesn't help. The str.encode() method always
> assumes that the string itself is ASCII-encoded, and that's not good
> enough:

> >>> "abc".encode("latin-1")
> 'abc'
> >>> "abc".decode("latin-1")
> u'abc'
> >>> "abc\xf0".decode("latin-1")
> u'abc\xf0'
> >>> "abc\xf0".encode("latin-1")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position
> 3: ordinal not in range(128)

These comments disturb me. I never really understood why (byte) strings grew
the 'encode' method, since 8-bit strings *are already encoded*, by their
very nature. I mean, I understand it's useful because Python does
non-unicode encodings like 'hex', but I don't really understand *why*. The
benefits don't seem to outweigh the cost (but that's hindsight.)

Directly encoding a (byte) string into a unicode encoding is mostly useless,
as you've shown. The only use-case I can think of is translating ASCII in,
for instance, EBCDIC. Encoding anything into an ASCII superset is a no-op,
unless the system encoding isn't 'ascii' (and that's pretty rare, and not
something a Python programmer should depend on.) On the other hand, the fact
that (byte) strings have an 'encode' method creates a lot of confusion in
unicode-newbies, and causes programs to break only when input is non-ASCII.
And non-ASCII input just happens too often and too unpredictably in
'real-world' code, and not enough in European programmers' tests ;P

Unicode objects and strings are not the same thing. We shouldn't treat them
as the same thing. They share an interface (like lists and tuples do), and
if you only use that interface, treating them as the same kind object is
mostly ok. They actually share *less* of an interface than lists and tuples,
though, as comparing strings to unicode objects can raise an exception,
whereas comparing lists to tuples is not expected to. For anything less
trivial than indexing, slicing and most of the string methods, and anything
what so ever involving non-ASCII (or, rather, non-system-encoding), unicode
objects and strings *must* be treated separately. For instance, there is no
correct way to do:

  s.split("\x80")

unless you know the type of 's'. If it's unicode, you want u"\x80" instead
of "\x80". If it's not unicode, splitting "\x80" may not even be sensible,
but you wouldn't know from looking at the code -- maybe it expects a
specific encoding (or encoding family), maybe not. As soon as you deal with
unicode, you need to really understand the concept, and too many programmers
don't. And it's very hard to tell from someone's comments whether they fail
to understand or just get some of the terminology wrong; that's why Guido's
comments about 'encoding a byte string' and 'what if the file encoding is
Unicode' scare me. The unicode/string mixup almost makes me wish Python
was statically typed.

So please, please, please don't make the mistake of 'doing something' with
the 'encoding' argument to 'bytes(s, encoding)' when 's' is a (byte) string.
It wouldn't actually be usable except for the same things as 'str.encode':
to convert from ASCII to non-ASCII-supersets, or to convert to non-unicode
encodings (such as 'hex'.) You can achieve those two by doing, e.g.,
'bytes(s.encode('hex'))' if you really want to. Ignoring the encoding
(rather than raising an exception) would also allow code to be trivially
portable between Python 2.x and Py3K, when "" is actually a unicode object.

Not that I'm happy with ignoring anything, but not ignoring would be bigger
crime here.

Oh, and while on the subject, I'm not convinced going all-unicode in Py3K is
a good idea either, but maybe I should save that discussion for PyCon. I'm
not thinking "why do we need unicode" anymore (which I did two years ago ;)
but I *am* thinking it'll be a big step for 90% of the programmers if they
have to grasp unicode and encodings to be able to even do 'raw_input()'
sensibly. I know I spend an inordinate amount of time trying to explain the
basics on #python on irc.freenode.net already.

-- 
Thomas Wouters <thomas at xs4all.net>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!