[IPython-dev] ASCII Terminal IPython re-encodes bytes greater than 127

Fri Jul 25 17:34:51 EDT 2014

Hi Tom,

It's been a couple of years since I investigated this, but from what I
remember, the trouble is that either the representation of bytes literals
in a piece of code stored as unicode, or the representation of unicode
literals in a piece of code stored as bytes, is ambiguous. That is, either
of these can do the wrong thing, depending on your locale:

exec(b"a = u'þ'")
exec(u"a = b'þ'")

The first one will be wrong in a UTF-8 terminal (i.e. modern Linux or Mac),
and the second will be wrong in, IIRC, a non-UTF8 terminal (i.e. Windows).
We decided that non-ascii characters in unicode strings were more important
than non-ascii characters in byte strings, so we compile the code as a
unicode string, so that unicode literals are handled correctly. The fact
that Python 3 throws a syntax error for non-ascii characters in a bytes
literal supports this choice: the case we get 'wrong' is not even allowed
on Python 3.

I have thought about how we could get this 'right', i.e. matching the plain
Python shell, for non-ascii bytes literals in Python 2 in non-UTF-8
terminals, but all of the options seem worse than the current situation:
- The Python shell itself seems to interface with the parser/compiler in a
way that is not possible from pure Python programs
- We could prepend a "# coding: foo" comment to each line of code before
parsing it, inserting the terminal's encoding. But this messes up line
numbers and tracebacks.
- We could parse each piece of code *twice*, once as bytes and once as
unicode, then walk the ASTs and copy bytes literals from the bytes-parsed
tree to the unicode-parsed tree (or unicode literals the other way). That's
filed under "ideas too clever for their own good".

So it's certainly not a feature, but it's a flaw that very few people have
seemed to run into. Before IPython 0.11, we treated code the other way,
breaking unicode literals, which did result in bug reports and patches that
appeared to fix the issue without really working out the details.

I hope that helps, feel free to ask if you have any more questions about
this,
Thomas

On 25 July 2014 14:04, Thomas Ballinger <tom at hackerschool.com> wrote:

> I'm interested about why IPython (python 2) does ascii encoding the way it
> does.
>
> When I run ipython in a ascii terminal and enter a byte greater than 127,
> it appears to be decoded using latin-1 and re-encoded with utf8.
>
>     In [1]: '<meta-a, or byte \xe1>'
>     Out [1]: '\xef\xbf\xbd'
>
> In a ascii-encoded python file, this would be an error. In an ASCII
> vanilla Python interpreter, this would be just the byte entered, '\xe1'. In
> terminal ipython, just entering the byte (without quotes) gives ERROR -
> failed to write data to stream. In vanilla ascii python, this would be a
> syntax error.
>
> I'm interested in why this decision was made. I'm in the process of
> choosing a behavior for bpython, and so far can think of:
>
> 1) vanilla python 2 behavior - run source code as bytes when terminal is
> ascii encoded
> 2) vanilla python 3 behavior - syntax error on finding this character
> 3) ipython behavior - somehow figure out (guess? what happens in ipython?)
> which character is being represented on the user's terminal by this byte
> and decode it to unicode, then rencode it (all assuming it's in a string).
> I don't understand the specifics of this.
>
> I'm particularly interested in whether it's important functionality to
> users (maybe for localization? do some people's terminals say ASCII but
> really represent important characters they can enter with their keyboards?)
>
> Thanks very much for any thoughts,
>
> -Tom
>
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20140725/5edba15f/attachment.html>