[IPython-dev] ASCII Terminal IPython re-encodes bytes greater than 127
tom at hackerschool.com
Sat Jul 26 14:31:09 EDT 2014
I've been grappling with just this dichotomy of ambiguous representations,
so it's great to read it expressed so clearly.
If I understand correctly, IPython is something like
repr(eval(raw_input('>>> ').decode(sys.stdin.encoding, 'replace')))
and therefore b'þ' in an ascii encoded terminal will end up being the
unicode replacement character \ufffd because it can't be encoded in ascii,
the reported encoding. When the code is evaluated, if it's not in a string
literal it will be a syntax error (though in an ascii terminal this
traceback can't be written to stdout). If it appears in a unicode literal,
it's \ufffd, and it it's bytestring literal it's \xef\xbf\xdb, the utf8
encoding of the previous.
This is simpler than the behavior I guessed was happening because I didn't
look up what \ufffd was (
http://en.wikipedia.org/wiki/Specials_(Unicode_block) - I wrongly assumed
ipython was decoding this byte with latin-1 and then re-encoding it with
If one was in a position to reject keys on a byte-by-byte basis (as bpython
is) might it make sense to simply reject these bytes? If they come from the
keyboard, they're funny meta key presses (you pressed meta-a; it doesn't do
anything) and if they come from a paste event, the terminal emulator is
doing a terrible job encoding into the reported encoding. However a few
bytes missing would be more confusing though than a few characters being
replaced with \ufffd.
I think I want to ignore these bytes individually, but replace them with
\ufffd when they happen in paste events, but I'd love to hear comments on
this (can take them off this list if they're off topic. Thanks very much
for input (and for IPython, which is obviously awesome).
On Fri, Jul 25, 2014 at 5:34 PM, Thomas Kluyver <takowl at gmail.com> wrote:
> Hi Tom,
> It's been a couple of years since I investigated this, but from what I
> remember, the trouble is that either the representation of bytes literals
> in a piece of code stored as unicode, or the representation of unicode
> literals in a piece of code stored as bytes, is ambiguous. That is, either
> of these can do the wrong thing, depending on your locale:
> exec(b"a = u'þ'")
> exec(u"a = b'þ'")
> The first one will be wrong in a UTF-8 terminal (i.e. modern Linux or
> Mac), and the second will be wrong in, IIRC, a non-UTF8 terminal (i.e.
> Windows). We decided that non-ascii characters in unicode strings were more
> important than non-ascii characters in byte strings, so we compile the code
> as a unicode string, so that unicode literals are handled correctly. The
> fact that Python 3 throws a syntax error for non-ascii characters in a
> bytes literal supports this choice: the case we get 'wrong' is not even
> allowed on Python 3.
> I have thought about how we could get this 'right', i.e. matching the
> plain Python shell, for non-ascii bytes literals in Python 2 in non-UTF-8
> terminals, but all of the options seem worse than the current situation:
> - The Python shell itself seems to interface with the parser/compiler in a
> way that is not possible from pure Python programs
> - We could prepend a "# coding: foo" comment to each line of code before
> parsing it, inserting the terminal's encoding. But this messes up line
> numbers and tracebacks.
> - We could parse each piece of code *twice*, once as bytes and once as
> unicode, then walk the ASTs and copy bytes literals from the bytes-parsed
> tree to the unicode-parsed tree (or unicode literals the other way). That's
> filed under "ideas too clever for their own good".
> So it's certainly not a feature, but it's a flaw that very few people have
> seemed to run into. Before IPython 0.11, we treated code the other way,
> breaking unicode literals, which did result in bug reports and patches that
> appeared to fix the issue without really working out the details.
> I hope that helps, feel free to ask if you have any more questions about
> On 25 July 2014 14:04, Thomas Ballinger <tom at hackerschool.com> wrote:
>> I'm interested about why IPython (python 2) does ascii encoding the way
>> it does.
>> When I run ipython in a ascii terminal and enter a byte greater than 127,
>> it appears to be decoded using latin-1 and re-encoded with utf8.
>> In : '<meta-a, or byte \xe1>'
>> Out : '\xef\xbf\xbd'
>> In a ascii-encoded python file, this would be an error. In an ASCII
>> vanilla Python interpreter, this would be just the byte entered, '\xe1'. In
>> terminal ipython, just entering the byte (without quotes) gives ERROR -
>> failed to write data to stream. In vanilla ascii python, this would be a
>> syntax error.
>> I'm interested in why this decision was made. I'm in the process of
>> choosing a behavior for bpython, and so far can think of:
>> 1) vanilla python 2 behavior - run source code as bytes when terminal is
>> ascii encoded
>> 2) vanilla python 3 behavior - syntax error on finding this character
>> 3) ipython behavior - somehow figure out (guess? what happens in
>> ipython?) which character is being represented on the user's terminal by
>> this byte and decode it to unicode, then rencode it (all assuming it's in a
>> string). I don't understand the specifics of this.
>> I'm particularly interested in whether it's important functionality to
>> users (maybe for localization? do some people's terminals say ASCII but
>> really represent important characters they can enter with their keyboards?)
>> Thanks very much for any thoughts,
>> IPython-dev mailing list
>> IPython-dev at scipy.org
> IPython-dev mailing list
> IPython-dev at scipy.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPython-dev