Python and Jython inconsistencies when encoding strings

Martin v. Löwis loewis at informatik.hu-berlin.de
Fri Sep 6 09:11:22 EDT 2002


"Andre Michel Descombes" <amdescombes at qualicontrol.com> writes:

> I've noticed the following inconsistency between Python 2.1 and
> Jython 2.1 : when I do the following in Python :

> >>> s = u'£' # Alt-0163, pound sign
> >>> print s

Please try

>>> s
u'\x9c'

Notice that Unicode U+009C is a control character, STRING TERMINATOR,
and not U+00A3, POUND SIGN. You may wonder where it gets the 9C value
from: This is because, in a console window, Microsoft uses code page
850 (probably), which uses

<U00A3>     /x9c         POUND SIGN

So when you enter a Alt-0163, the terminal reports the byte 0x9c, not
the byte 0x00A3. In Python 2.x, you can only use non-ASCII in Unicode
literal if it is Latin-1. However, your pound sign is not entered in
Latin-1, but entered in cp850. To really get a pound sign into a
Unicode literal, you need to write

>>> s = u"\u00A3"
>>> print s.encode("cp850")

You should use cp850 here, since your terminal uses cp850, not
Latin-1.

> now when I do the same thing in Jython 2.1 I get this:
> 
> >>> s = u'£' # Alt-0163, pound sign
> >>> print s
> £
> >>>
> 
> and if I do:
> >>> print s.encode('latin-1')

This is interesting, since you also get

>>> s
u"\u0153"

Now, U+0153 is LATIN SMALL LIGATURE OE. It so happens that \x9c (what
the terminal sends) is U+0153 in CP 1252 (which is the ANSI code page
on your Windows installation). This might be a bug in Java, which
assumes that bytes sent by the terminal are in the ANSI code page,
when they are really in the OEM code page.

> Does anybody know what is causing this inconsistency? Is there any way to
> avoid it?

Yes. Don't use the console.

Regards,
Martin



More information about the Python-list mailing list