[Tutor] wierd error in comment string

Magnus Lyckå magnus at thinkware.se
Sun Apr 25 19:38:20 EDT 2004


At 00:25 2004-04-26 +0200, denis wrote:
>Denis :
>Would you add some contents on this ? Why doesn't python accept iso8859-1
>character set anymore ?

It does! It just wants you to be explicit, which is very Pythonic... ;)

>I just did a test on french weird characters, which were previously
>processed and, as you suggest :
>'éèàùëêç'
>'\xe9\xe8\xe0\xf9\xeb\xea\xe7'
>they're not anymore more :-(

I'm not sure what you did. What is the problem? This is on my Python 2.3
installation on Windows 2000:

 >>> print '\xe9\xe8\xe0\xf9\xeb\xea\xe7'
éèàùëêç
 >>> '\xe9\xe8\xe0\xf9\xeb\xea\xe7'
'\xe9\xe8\xe0\xf9\xeb\xea\xe7'

I think it's always been like this. repr() returns hex representation
for non-ASCII values, but str() translates them if it can.

>The standard 8-bit extended ASCII / ANSI set holds af them, as well as the
>iso 8859-1 set and the first unicode page of 256 characters.

I don't know all the details around this. I think that Mark-André Lemburg
and Martin Lövis might be the experts in this field (but I don't think they
follow tutor). Perhaps Tim Peters who is sometimes seen here knows more
about how the Python developers discussed around these issues.

It seems to me that in versions up to and including 2.2, Python just
let you as a programmer take responsibility for the interpretation of
characters in the 128-255 range of normal strings. It would just output
bytes as provided, and if you wrote a program assuming ISO 8859-1, and
someone ran that script in a computer with Japanese or Russian settings,
they would see garbage on the screen.

 From version 2.3, it seems Python is more and more adopting Unicode, and
an awareness that not all the world uses the same 8 bit code page, and
that the 128-255 range for strings is ambious unless you declare what
codepage you are using.

In a long term perspective, I hope that all strings will be Unicode in
some future, and that this will all be transparent, but right now, we
can't really ignore Unicode any longer if we use anything but US ASCII,
and it's not as transparent as one might wish.

Of course, the big disadvantage is that Python programs have to get a
little more cluttered to work right now, but it has some advantages, for
instance, something like...

# -*- coding: iso8859_1 -*-
print u'Magnus Lyckå'

...will print my name right both in Linux, in a Windows GUI window and
in a Windows command line prompt. Without unicode strings, I had to
convert it to cp850 (or possibly cp437) to get it to display right at a
"DOS prompt".

It's not without problems though. I've had plenty of problems at my
current client, which still uses Windows NT 4.0. For instance, I get
an exception if I try something like raw_input(u'åäö'). It works in
Windows 2000 though, so I guess it's just a problem in ancient Windows
versions. (Microsoft is dropping support for NT 4.0 right?)


--
Magnus Lycka (It's really Lyckå), magnus at thinkware.se
Thinkware AB, Sweden, www.thinkware.se
I code Python ~ The Agile Programming Language 




More information about the Tutor mailing list