[Python-Dev] Python code.interact() and UTF-8 locale

Victor STINNER victor.stinner-linux at haypocalc.com
Sun Sep 11 05:16:23 CEST 2005


Hi,

I found a bug in Python interactive command line (program python alone:
looks to be code.interact() function in code.py). With UTF-8 locale, the
command << u"é" >> returns << u'\xc3\xa9' >> and not << u'\xE9' >>.
Remember: the french e with acute is Unicode 233 (0xE9), encoded \xC3
\xA9 in UTF-8.

Another example of the bug:
  #-*- coding: UTF-8 -*-
  code = "u\"%s\"" % "\xc3\xa9"
  compiled = compile(code,'<string>',"single")
  exec compiled
Result :
  u'\xc3\xa9'
Excepted result :
  u'\xe9'

After long hours of debuging (read Python documentation, debug Python
with gdb, read Python C source code, ...) I found the origin of the bug:
function parsestr() in Python/compile.c. This function translate a
string to a unicode string (or a classic string). The problem is when
the encoding declaration doesn't exist: the string isn't converted.

Solution to the first code:
  #-*- coding: ascii -*-
  code = """#-*- coding: UTF-8 -*-
  u\"%s\"""" % "\xc3\xa9"
  compiled = compile(code,'<string>',"single")
  exec compiled

Proposition: u"..." and unicode("...") should use sys.stdin.encoding by
default. They will work as unicode("...", sys.stdin.encoding). Or
easier, the compiler should use sys.stdin.encoding and not ascii as
default encoding.

Sorry if someone already reported this bug. And, is it a bug or a
feature ? ;-)

Bye, Haypo (who just have subscribed to the mailing list)



More information about the Python-Dev mailing list