[IPython-dev] IPython handles code input as latin1 instead of the system encoding

Thu Jun 17 17:28:40 EDT 2010

Hi,

I just noticed a problem with non-ascii input in ipython, that can be
seen below:

Python behavior (expected):

---------
$ python
Python 2.6 (r26:66714, Nov  3 2009, 17:33:38)
[GCC 4.4.1 20090725 (Red Hat 4.4.1-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys, locale
>>> print sys.stdin.encoding,locale.getdefaultlocale()
UTF-8 ('en_US', 'UTF8')
>>> print repr(u'áé')
u'\xe1\xe9'
-------------
(two unicode characters as result, as expected)


IPython behavior:

------------------
$ ipython
Python 2.6 (r26:66714, Nov  3 2009, 17:33:38)
Type "copyright", "credits" or "license" for more information.

IPython 0.11.alpha1.git -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object'. ?object also works, ?? prints more.

In [1]: import sys, locale

In [2]: print sys.stdin.encoding,locale.getdefaultlocale()
UTF-8 ('en_US', 'UTF8')

In [3]: print repr(u'áé')
u'\xc3\xa1\xc3\xa9'

In [4]: 
----------------------
(four unicode characters as result, because the utf-8 source code was
evaluated as if it was encoded as latin1)


System info:
Fedora 11
python-2.6-12.fc11.i586
IPython compiled from git (commit 01435682f2751a31eb3dbe0eaae9fc6dc960f8fb)


By looking at the code, the problem seems to be that all code is being
converted back to byte strings before being passed to the Python compiler. This
makes the compiler just guess the encoding of the source string. Passing a
unicode object to CommandCompiler() makes Python behave as expected:


>>> import codeop
>>> cp = codeop.CommandCompiler()
>>> exec cp('print repr(u"áé")')   # not the expected behavior
u'\xc3\xa1\xc3\xa9'
>>> exec cp(u'print repr(u"áé")')  # the expected behavior
u'\xe1\xe9'

To prove it's not a local problem on my terminal encoding, the same issue can
be seen using ascii-only '\x' strings as source:

>>> code = 'print repr(u"\xc3\xa1\xc3\xa9")' # utf-8 source code
>>> print code
print repr(u"áé")
>>> exec cp(code)
u'\xc3\xa1\xc3\xa9'
>>> exec cp(unicode(code, 'utf-8'))
u'\xe1\xe9'
>>> 


I believe the cause of the problem is at IPython/core/iplib.py, at
InteractiveShell.runsource():

    def runsource(self, source, filename='<input>', symbol='single'):
       [...]
       source=source.encode(self.stdin_encoding)


I don't know why IPython reencodes the unicode object to a byte string before
passing it to the Python compiler. The Python interactive shell doesn't do that
with the unicode objects given as input, it just passes the unicode object
directly to CommandCompiler, as it can be seen at:

http://svn.python.org/projects/python/trunk/Lib/code.py

(there are no calls to .encode() in the Python code at the URL above).

The patch below solves the problem to me, but I am not completely sure
it won't cause any trouble on other platforms or older Python versions.

---

diff --git a/IPython/core/iplib.py b/IPython/core/iplib.py
index ae56cfb..477a45c 100644
--- a/IPython/core/iplib.py
+++ b/IPython/core/iplib.py
@@ -2146,9 +2146,8 @@ class InteractiveShell(Component, Magic):
         # this allows execution of indented pasted code. It is tempting
         # to add '\n' at the end of source to run commands like ' a=1'
         # directly, but this fails for more complicated scenarios
-        source=source.encode(self.stdin_encoding)
-        if source[:1] in [' ', '\t']:
-            source = 'if 1:\n%s' % source
+        if source[:1] in [u' ', u'\t']:
+            source = u'if 1:\n%s' % source
 
         try:
             code = self.compile(source,filename,symbol)

-- 
Eduardo